


'-'*» 



"Y^M 





i v*f 



w^t 



^j^^ms^ 



.GENt SYSTEMS REFERENCE LIBRARY 
— Volume 21 




Igor Chikalov 

Average Time Complexity of Decision Trees 



Intelligent Systems Reference Library, Volume 21 
Editors-in-Chief 



Prof. Janusz Kacprzyk 

Systems Research Institute 

Polish Academy of Sciences 

ul. Ncwclska 6 

01-447 Warsaw 

Poland 

E-mail: kacprzyk@ibspan.waw.pl 



Prof. Lakhmi C. Jain 

University of South Australia 

Adelaide 

Mawson Lakes Campus 

South Australia 5095 

Australia 

E-mail: Lakhmi.jain@unisa.edu.au 



Further volumes of this series can be found 
on our homepage: springer.com 



Vol. 5. George A. Anastassiou 
Intelligent Mathematics: Computational 
Analysis, 2010 
ISBN 978-3-642-17097-3 

Vol. 6. Ludmila Dymo^va 

Soft Computing in Economics and Finance, 

2011 

ISBN 978-3-642-17718-7 

Vol. 7. Gerasimos G. Rigatos 
Modelling and Control for Intelligent 
Industrial Systems, 2011 
ISBN 978-3-642-17874-0 

Vol. 8. Edward H.Y. Lim, James N.K. Liu, 

and 

Raymond S.T. Lee 

Knowledge Seeker - Ontology Modelling for 

Information 

Search and Management, 2011 

ISBN 978-3-642-17915-0 

Vol. 9. Menahem Friedman and Abraham 
Kandel 

Calculus Light. 2011 
ISBN 978-3-642-17847-4 

Vol. 10. Andreas Tolk and Lakhmi C. Jain 
Intelligence- Based Systems Engineering, 2011 
ISBN 978-3-642-17930-3 

Vol. 11. Samuli Niiranen and Andre Ribeiro 

(Eds.) 

Information Processing and Biological 

Systems, 2011 

ISBN 978-3-642-19620-1 

Vol. 12. Florin Gorunescu 
Data Mining, 2011 
ISBN 978-3-642-19720-8 



Vol. 13. Witold Pedrycz and Shyi-Ming Chen 

(Eds.) 

Granular Computing and Intelligent Systems, 

2011 

ISBN 978-3-642-19819-9 

Vol. 14. George A. Anastassiou and Oktay 

Duman 

Towards Intelligent Modeling: Statistical 

Approximation Theory, 2011 

ISBN 978-3-642-19825-0 

Vol. 15. Antonino Freno and Edmondo 

Trent in 

Hybrid Random Fields, 2011 

ISBN 978-3-642-20307-7 

Vol. 16. Alexiei Dingli 

Knowledge Annotation: Making Implicit 

Knowledge Explicit, 2011 

ISBN 978-3-642-20322-0 

Vol. 17. Crina Grosan and Ajith Abraham 
Intelligent Systems, 2011 
ISBN 978-3-642-21003-7 

Vol. 18. Achim Zielesny 

From Curve Fitting to Machine Learning, 

2011 

ISBN 978-3-642-21279-6 

Vol. 19. George A. Anastassiou 
Intelligent Systems: Approximation by 
Artificial Neural Networks, 2011 
ISBN 978-3-642-21430-1 

Vol. 20. Lech Polkowski 
Approximate Reasoning by Parts, 2011 
ISBN 978-3-642-22278-8 

Vol. 21. Igor Chikalov 

Average Time Complexity of Decision Trees, 

2011 

ISBN 978-3-642-22660-1 



Igor Chikalov 



Average Time Complexity 
of Decision Trees 



^Spri 



ringer 



Dr. Igor Chikalov 

Mathematical and Computer Sciences 

and Engineering Division 

4700 King Abdullah University of Science 

and Technology 

Thuwal 23955-6900 

Kingdom of Saudi Arabia 

E-mail: igor.chikalov@kaust.edu.sa 



ISBN 978-3-642-22660-1 e-ISBN 978-3-642-22661-8 

DOI 10.1007/978-3-642-22661-8 

Intelligent Systems Reference Library ISSN 1868-4394 

Library of Congress Control Number: 2011932437 
© 2011 Springer- Verlag Berlin Heidelberg 

This work is subject to copyright. All rights are reserved, whether the whole or 
part of the material is concerned, specifically the rights of translation, reprinting, 
reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in 
any other way, and storage in data banks. Duplication of this publication or 
parts thereof is permitted only under the provisions of the German Copyright 
Law of September 9, 1965, in its current version, and permission for use must 
always be obtained from Springer. Violations are liable to prosecution under the 
German Copyright Law. 

The use of general descriptive names, registered names, trademarks, etc. in this 
publication does not imply, even in the absence of a specific statement, that such 
names are exempt from the relevant protective laws and regulations and therefore 
free for general use. 

Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. 

Printed on acid-free paper 

987654321 

springer.com 



To my wife Julia and our daughter 
Svetlana, for always believing in me and 
loving me, no matter what. 



Foreword 



It is our great pleasure to welcome a new book "Average Time Complexity of 
Decision Trees" by Igor Chikalov. This book is devoted to the study of average 
time complexity (average depth and weighted average depth) of decision trees 
over finite and infinite sets of attributes. It contains exact and approximate 
algorithms for decision tree optimization, and bounds on minimum average 
time complexity of decision trees. The average time complexity measures can 
be used in searching for the minimum description length of induced data 
models. Hence, there exist relationships of the presented results with the 
minimum description length principle (MDL). 

The considered applications include the study of average depth of decision 
trees for Boolean functions from closed classes, the comparison of results of 
the performance of greedy heuristics for average depth minimization with 
optimal decision trees constructed by dynamic programming algorithm, and 
optimization of decision trees for the corner point recognition problem from 
computer vision. 

The book can be interesting for researchers working on time complexity 
of algorithms and specialists in machine learning. 

The author, Igor Chikalov, received his PhD degree in 2002 from Nizhny 
Novgorod State University, Russia. During nine years he was working for 
Intel Corp. as a senior software engineer/research scientist in machine learn- 
ing applications to the control and diagnostic problems of semiconductor 
manufacturing. Since 2009 he is a senior research scientist in King Abdul- 
lah University of Science and Technology, Saudi Arabia. His current research 
interests include supervised machine learning and extensions of dynamic pro- 
gramming to the optimization of decision trees and decision rules. 

The author deserves the highest appreciation for his outstanding work. 



Mikhail Moshkov 
May 2011 Andrzej Skowron 



Preface 



The monograph is devoted to theoretical and experimental study of decision 
trees with a focus on minimizing the average time complexity. The study re- 
sulted in upper and lower bounds on the minimum average time complexity 
of decision trees for identification problems. Previously known bounds from 
information theory are extended to the case of identification problem with 
an arbitrary set of attributes. Some examples of identification problems are 
presented giving an evidence that the obtained bounds are close to unimprov- 
able. In addition to universal bounds, we study effectiveness of representing 
several types of discrete functions in a form of decision trees. In particular, 
for each closed class of Boolean functions we obtained upper bounds on the 
average depth of decision trees implementing functions from this class. 

The monograph also studies the problem of algorithm design for optimal 
decision tree construction. An algorithm based on dynamic programming 
is proposed that describes a set of optimal trees and allows for subsequent 
optimization on other criteria. Experimental results show applicability of the 
algorithm to real-life applications that are represented by decision tables 
containing dozens of attributes and several thousands of objects. 

Beside individual identification problems, infinite classes of problems are 
considered. It describes necessary conditions on such classes in order to have 
polynomial complexity algorithms for optimal decision tree construction. 

The presented results can be of interest for researchers in test theory, 
rough set theory and machine learning. Some results may be considered for 
including in graduate courses on discrete mathematics and computer science. 
The monograph can be used as a reference to prior results in the area. 

Some results were obtained in collaboration with Dr. Mikhail Moshkov 
and published in joint papers [5T1 [SH [531 IMl HSj- I am heartily thankful to 
Dr. Moshkov for help in preparing this book. 



X Preface 

I would like to acknowledge and extend my gratitude to Victor Eruhiniov 
for fruitful discussions about applications of decision trees and Dr. Andrzej 
Skowron for constructive criticism and suggestions for improvement of the 
book. 



Thuwal, Saudi Arabia, 

April 2011 Igor Chikalov 
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Chapter 1 
Introduction 



Decision trees appeared in 50-60s of the last century in theoretical computer 
science jTH EH [SD] and applications PU [37]. Similar objects are also con- 
sidered by natural and social sciences, for example, taxonomy keys [301 or 
questionnaires [S3! ■ Decision trees naturally represent identification and test- 
ing algorithms that specify the next test to perform based on the results of 
the previous tests. A number of particular formulations were generalized by 
Garey [22| as identification problem that is a problem of distinguishing ob- 
jects described by a common set of attributes. More general formulation is 
provided by decision table framework [311 IM| where objects can have incom- 
plete set of attributes and non-unique class labels. In that case, acquiring 
class label is enough to solve the problem: identifying a particular object is 
not required. In this context, decision trees found many applications in test 
theory [33 ESI ill [H] , fault diagnosis dHEOKIlj, rough set theory [SI1E2], 
discrete optimization, non-procedural programming languages [34], analysis 
of algorithm complexity [38j . computer vision [745 , computational geometry 

m- 

Decision tree is also a way of representing data in a structured hierarchical 
manner. It describes a recursive partitioning of a set of objects into groups 
according to the attribute values. Such representation reveals various patterns 
in data like object similarity and common characteristics of several objects. If 
objects are divided into classes, decision tree gives an idea of which attributes 
are important for assigning an object to a certain class. In machine learning 
problems, decision trees show ability to generalization that is capturing strong 
dependencies only and ignoring the weak ones which are resulted from a finite 
sample size and do not reflect properties of the data source [S] [7T] . Compact 
decision trees are easily interpreted by human experts that makes it favorable 
over other models. The state-of-the-art statistical modeling techniques like 
tree ensembles [T] [55] use decision trees for its insensitivity to outliers and 
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2 1 Introduction 

uniform way of dealing with numeric and categorical (discrete unordered) 
attributes. 

In most cases, multiple decision trees are available for the same problem. 
Not all of them are equally favorable. Depending on the application, a tree 
is required to have minimal storage complexity or guaranteed time complex- 
ity in all cases, or minimal expected number of tests. This leads to different 
strategies for building of decision trees. Bounds on the minimum tree com- 
plexity and algorithms for optimal tree constriction are studied in test theory 
dH iS US 113 HU, rough set theory [S3 [SI 113, search theory [Ij, machine 
learning ^31 [7T] . It was discovered, that for almost all criteria, the problem 
of building an optimal decision tree is A''P-hard, and for many cases there are 
results preserving a polynomial time approximation. The problem of design 
of effective algorithms for building decision trees is still open. Though, recent 
advances proved that greedy algorithms [T^ [35] build trees that are close to 
optimal for some cases. 

In this monograph, several known results on the average time complexity of 
decision trees are generalized and a number of new problems are considered. 
The main goal is to obtain bounds on the minimum average time complexity 
of decision trees and design effective algorithms for building decision trees for 
some classes of information systems. Methods of combinatorics, probability 
theory and complexity theory are used in the proofs as well as concepts from 
various branches of discrete mathematics and computer science. 

The monograph consists of five chapters. Chapter 12.4.31 considers bounds 
on the minimum average weighted depth of decision trees. Upper and lower 
bounds on the average time complexity of decision trees were known previ- 
ously for a problem with a complete set of attributes. These bounds depending 
only on the entropy of probability distribution follow from results of coding 
theory [HI [77] and are widely applied in search theory (see, e.g. pj). Chap- 
ter 12.4.31 generalizes these bounds to the case of the average weighed depth 
of decision trees for an arbitrary identification problem. In the first section, 
an upper bound on the average weighted depth of decision trees and more 
precise bound on the average depth are proved. These bounds depend on the 
entropy and a parameter M{z), which is introduced by Moshkov in (46| . An 
analogous parameter of the exact learning problem is called extended teach- 
ing dimension |H[33]. In general case, calculating M(z) is computationally 
intractable, but for several classes of problems, either exact value or tight 
bounds on M{z) can be obtained by theoretical analysis. 

The second section describes conditions on the problem structure and the 
probability distribution for objects that enable problem decomposition. Un- 
der these conditions an optimal decision tree for the initial problem can be 
synthesized from optimal decision trees for simpler problems. This technique 
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is used to build a class of problems for which the minimum average depth of 
decision tree is close to its upper bound given in the first section. 

Chapter 13.21 is devoted to several applications that can effectively use de- 
cision trees. It consists of two sections. The first section studies the average 
depth of decision trees implementing Boolean functions. A Shannon type 
function is considered that describes growth of the average depth of decision 
trees with growth of the number of arguments in the functions being imple- 
mented. For each closed class of Boolean functions [BSl I3S]) a lower and an 
upper bound is obtained on a Shannon type function characterizing this class. 
The obtained results are compared to the analogous results for the depth of 
decision trees described in [1H| . The notion of decision table partition used in 
the proofs is similar to the notion of system of nonoverlapping coverings of 
Boolean cube used in spectral methods of digital logic [6^ , but the type of cov- 
ering is estimated from the parent closed class of the function rather than its 
spectral properties. It allows to improve lower bounds on the average depth 
of decision trees for some functions (e.g., the voting function and the logical 
sum). The second section shows that each branching program with the mini- 
mum average weighted depth is a read-once branching program. Due to this 
fact, known exponential lower bounds on the number of nodes in read-once 
branching programs for several combinatorial problems [S21 EH [H31 [HH El] 
are applicable to branching programs with the minimum average weighted 
depth. 

Chapter 14.4.21 is devoted to algorithms for decision tree construction. The 
first section describes an algorithm A that builds a set of decision trees with 
the minimum average weighted depth for a problem given in a form of de- 
cision table. The idea of the algorithm is based on dynamic programming 
p7l l42l \Q0\ [76| . The second section describes experimental results of using A 
for implementing Boolean functions by decision trees. The third section is de- 
voted to greedy algorithms. It describes a general scheme of greedy algorithm, 
defines several data impurity functions, and describes results of a comparative 
study of performance of several greedy algorithms applied to data sets from 
UCI Machine Learning Repository f25j. The fourth section describes results 
of applying ^ to a practical problem of computer vision — fast detection of 
corner points [7S] . 

Chapter 15.21 considers a class of information systems called restricted in- 
formation systems. It consists of two sections. The first section proves that 
for restricted information systems (and only for such systems), there exist 
upper bounds on the average depth of decision trees that depend only on 
the entropy of object probability distribution. The second section gives nec- 
essary and sufficient conditions that make the time complexity of the above 
considered algorithm A limited from above by a polynomial on the number of 
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rows in a table. These conditions contain the requirement for the information 
system to be restricted. In [1., the average depth of decision tree is studied 
for some problems (e.g., the problem of finding a leak in a pipeline). The 
obtained results generalize bounds from [Ij to an arbitrary restricted infor- 
mation system and give polynomial algorithms for building optimal decision 
trees. 

The monograph contains mainly theoretical results that can be used for 
design of effective algorithms for building decision trees and for analysis of 
complexity of representing various objects by decision trees. These results 
can be of interest for researchers in test theory, rough set theory and logical 
analysis of data. The monograph can be used as a part of a course for graduate 
students and Ph. D. studies. 



1.1 Basic Notions 

Denote uj = {0, 1,2,.. .}, and for fc e w \ {0, 1}, denote Ek = {0, . ■ . ,k - 1}. 

1.1.1 Information Systems 

Let A be a nonempty set, F a set of functions defined on A and taking 
values from Ek, so that for any f E F, the condition / ^ const holds. The 
functions from F are called attributes, and the pair U — {A, F) is called 
k-valued information system (or simply information system) . 

A weight function for the information system {/ is a function of the form 
^ : F ^ {1, 2, . . .} that assigns a weight 'F{f) to each attribute f £ F. 

1.1.2 Problems Over Information Systems 

A problem over the information system U is defined by a tuple z — {i^, fi, . . . , 
/„), where v : E^ ~* {0, 1, . . . , fc" - 1} and /i, . . . , /„ G F. The problem 
z consists in finding the value z{a) — i'(/i(a), . . . , /„(a)) for an arbitrary 
element a E A. 

Two elements a and b from A are equivalent for the problem z if fi (a) = 
fi{b) for i = l,...,n. This equivalence relation defines a partition of A 
into nonempty equivalence classes Qi, . . . ,Qs- Let us denote by T^ the set 
{Ji, . . . , Js} C EJ^ where di = (/i(a), . . . , /„(a)) and a € Qi, i ^ 1, . . . ,s. 
A problem z is called diagnostic if for any two tuples di, dj E Tz, di ^ dj, the 
condition ^{di) ^ v{dj) holds. 
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Probability distribution for the problem z is a mapping P : T^ ^ lo \ {0}. 
For d £ Tz, the value Pid)/^g^j' -P(^) can be interpreted as a probability 
of the event (/i(a), . . . , fn{o)) = d for an arbitrary element a from A. 

1.1.3 Decision Trees 

A decision tree for the problem z — (i^, /i, . . . , /„) is a finite oriented tree 
with root in which: 

• each nonterminal node is assigned with an attribute from the set {/i, . . . , 
/„} (i.e. decision trees use only the attributes listed in the description of 
the problem z)\ 

• each nonterminal node has exactly k outgoing edges which are labeled with 
the numbers 0, . . . , fc — 1 respectively; 

• each terminal node is assigned with a number from w. 

Let us describe the algorithm represented by a decision tree. Let the input 
be an element a G A. First, the root is assigned to be the current node. 
Let us describe one step of the algorithm. If the current node is terminal, 
the algorithm returns as result the number assigned to the current node and 
finishes. Otherwise, let fc be the attribute assigned to the current node. For 
(5 = 0, . . . , fc — 1, let 65 be the edge that leaves the current node and is labeled 
with 5. The value fc{o.) is calculated, and the node that the edge ef^(a) enters 
becomes the current node. Then the algorithm proceeds to the next step. 

1.1.4 Decision Tables 

Let [/ be a fc-valued information system, ^ a weight function for U, z = 
(v, fi, . . . , /„) a problem over U, and P a probability distribution for z. Let 
Tz = {di, . . . ,ds}. The set Tz can be represented as a rectangular table filled 
with numbers from Ek- Rows of the table correspond to the equivalence classes, 
columns to the attributes, and each number is the value of the corresponding 
attribute for all elements of the corresponding equivalence class. Let us assign 
the i-th column with the weight of the attribute fi for i ~ 1, . . . , n, and assign 
the row dj with the numbers i^idj) and P{dj) for j = 1, . . . , s. We will de- 
note the resulted table T* and call it decision table for the problem z. Further 
several algorithms will be considered that take as input a tabular representa- 
tion of the problem z. 

A two-player game can be associated with the table T^. The first player 
thinks of a row d from T* . The goal of the second player is to ascertain the 
number v{d) assigned to the row d in T^. The second player is allowed to 
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ask questions of the following form: he can choose a column and ask which 
number is on the intersection of the column and the row that the first player 
has in mind. A strategy of the second player can be represented in a form of 
a decision tree. 

Denote Q^, == {{ft,5) : ,U € {/i,...,/„}, S e Ek}, and denote J7* the 
set of all finite words in the alphabet i?^ including the empty word A. Let 
us extend the mapping 'F to the set H'*. Let a be an arbitrary word from 
[2*. If a = A, then 'F{a) ~ 0. For a =- {fi-^,5i) . . .{fi^,St), i > 0, assume 

Let a (z f2*. Define a separable suhtahle T^a of the table T^ in the following 
way. If a = A, then Tza = Tz. Let a ^ X and a — (fi-^ ,Si)... {fi^ , Sm)- Then 
TzCK is the subtable of the table Tz that contains only the rows which have 
the numbers Si, . . . ,Sm in the columns ii, . . . ,im respectively. We will say 
that a table is terminal if it contains no rows or iy{x) = const on the set of 
rows. Denote S{z) the set of all nonterminal separable subtables of the table 
Tz. 

For an arbitrary table T from S{z), we denote by D{T) the number of 
rows in T and denote N{T, P) = J^dex ^(^)- 

Let -T be a decision tree for the problem z. Set to the correspondence 
to each path ^ ~ vi,ri, . . . ,vt,rt,vt^i in F a word 7r(^) G i7*. Let t > 1, 
for j = l,...,i, the node Vj be assigned with an attribute ft-, and the 
edge rj, leaving Vj and entering Vj+i be assigned with a number 6j. Then 
7r(^) = (/ij, Si), . . . , {fi^, St). We assume 7r(^) = A for a path ^ consisting of 
a single node . 

A path from the root to a terminal node is called complete. Denote 
S{r) the set of complete paths in decision tree F. One can see that 
Ufe-rr) TzT^{0 — Tz, and for any two different complete paths ^i, ^2, the 
relation T^7r(^i) n T^ 71(^2) = holds. 

We will state that a decision tree F solves the problem z if for an arbitrary 
row d G Tz, the terminal node of the complete path ^ such that d E TzTt{S) is 
assigned with the number v{d). In other words, for an arbitrary element a G 
A, the terminal node of the path on which computations for a are performed 
is labeled with the number z{a). 

1.1.5 Complexity Measures of Decision Trees 

Let U = {A, F) be an information system, ^ a weight function for U and 
z a problem over U . Let _r be a decision tree for z that solves the problem 
z. For an arbitrary row d E Tz, denote ^'^ the complete path in F on which 
computations for the n-tuplc of attribute values d are performed. 
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As the main complexity m.easure the average weighted depth of the decision 
tree F relative to the probability distribution P (or, briefly, P-average weighted 
depth of r) will be used. It is defined in the following way: 

In addition to the average weighted depth, the weighted depth will be used 
as a complexity measure of decision trees. Weighted depth of decision tree P 
is defined as follows: 

g^{P,z)^max^{7:{^^)). 

deT, 

If If" = 1, then the considered complexity measures are called average depth 
and depth, and denoted h{P, P, z) and g{r, z). Further we will omit the symbol 
z in the notations h{r, P, z) and g{r, z) if it is clear which problem is meant. 
Denote hip{z, P) and h{z, P) respectively the minimum P-average weighted 
depth and the minimum P-average depth of the decision tree for the prob- 
lem z that solves z. For a weight function ^ , a problem z and a probability 
distribution P, a decision tree that solves z and has the minimum P-average 
depth is called optimal for z and P, and a tree that solves z and has the 
minimum P-average weighted depth is called optimal for ^, z and P. 



1.2 Overview of Results 

This section briefly describes main theoretical results of the monograph. 

1.2.1 Bounds on Average Weighted Depth 

Let U be an information system and ^ a weight function for U. First, 
we define a parameter M^{z) for a problem z — (i^, /i, . . . , /„) over U. If 
z{x) = const on the set A, then M^{z) = 0. Otherwise, for an arbitrary tuple 
(5 = ((5i, . . . , (5„) G E^, denote M^{z, S) the minimum natural number m such 
that there exist numbers ii, . . . , v G {Ij • • • j "■} possessing the following con- 
ditions: 'I'ifii) + . . . + 'I'ifi^) < rn and either the set of solutions on A of the 
system of equations {/^j (x) ~ Si-^, . . . , fi^ (x) ~ Si^} is empty or z{x) = const 
on this set. Then 

M^{z) = max M^{z,S) . 

If If" = 1, then the parameter M^{z) is denoted by M{z). 

As a parameter of probability distribution P we will use the entropy of 
probability distribution 
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H{P) = log2 iV(r„ P) - ^ 5] F(d) log2 P(d) . 

If a diagnostic problem contains all possible attributes, then a known noise- 
less coding theorem is applicable saying that the minimum average depth of 
decision tree is between H{P) and H{P) + 1. The following theorem gener- 
alizes the lower bound to the case of the average weighted depth of decision 
tree for an arbitrary diagnostic problem. 

Theorem. ( Theorem \^~^ from Sect. \2.S^) Let U be a k-valued information 
system, ^ a weight function for U , z a diagnostic problem over U , and P a 
probability distribution for z. Then 

logafc 



The following theorem gives an upper bound on the minimum average 
weighted depth of decision tree for an arbitrary problem. 

Theorem. (Theorem \2.3\ from Sect. \2.^l Let U he an information system, ^ 
a weight function for U , z a problem over U , and P a probability distribution 
for z. Then 

h^{z,P) <AU{z){H{P) + l) . 

Since the average depth of decision tree is a particular case of the average 
weighted depth, the above considered bounds hold for the average depth as 
well. However, the upper bound on the average depth can be improved. 

Theorem. (Theorem \2.4\ from Sect. W^) Let z be a problem over an infor- 
mation system U , and P a probability distribution for z. Then 

{m{z), ifM{z)<l, 

h{z, P) < I M{z) + 2H{P) , if2< M{z) < 3 , 

[M{^) + T£^HiP), z/M(z)>4. 

In Sect. 12.41 a possibility of reduction is considered for a problem over 2-valued 
information system. An algorithm is described that constructs a decision 
tree for the initial problem from decision trees for subproblems that form so- 
called proper decomposition of the original problem. The section also contains 
sufficient conditions for the synthesized decision tree to be optimal relative 
to the average depth. 



1.2 Overview of Results 



The decomposition technique allows finding decision trees with the mini- 
mum average depth for some classes of problems. In Sect. 12.4.31 it is used to 
prove that the upper bound on the average depth of decision tree given by 
Theorem 12.41 is close to unimprovable. 



Theorem. For an arbitrary natural numbers m > 2, n > 5 there exists an 
information system U^, a problem z^ over U^ with m" classes of equivalence 
and a probability distribution P^ = 1 such that 

^^'''^^- 21og,M(z) • 

This theorem immediately follows from Theorem 12.61 given in Sect. [5331 

1.2.2 Representing Boolean Functions by Decision 
Trees 



In Chap. 13.21 efficiency of representation of Boolean functions by deci- 
sion trees is studied. A Boolean function /(xi, . . . ,a;„) can be represented 
as a problem z = (/, xi, ...,a;„) over the information system C/„ = 
(£^J, {xi, . . . ,Xn}). The problem z has two equivalence classes Qo and Qi 
containing the sets of binary tuples on which / takes the values and 1 
respectively. A decision tree solving the problem z is called a decision tree 
implementing f. Denote by g{f) and h{f) respectively the minimum depth of 
a decision tree implementing / and the minimum average depth of a decision 
tree implementing / relative to the probability distribution P = 1. 

Denote by dim/ the number of arguments of the function /. Let _B be a 
set of Boolean functions. Consider the functions 

gsin) = ma.x{g{f) : f e B,dimf < n} 

and 

nsin) = max{/i(/) : / G B,dim/ < n} 

that characterize the growth in the worst case of the minimum depth and 
the minimum average depth of decision trees implementing Boolean functions 
from B with growth of the number of function arguments. Note that Tisin) < 
Qsin) for any n. 

Section 13.1.21 contains several statements that give an upper and a lower 
bounds of HBin) for each closed class of Boolean functions B. The nota- 
tion of closed classes of Boolean functions is in accordance with [36j; the 
classes and the class inclusion diagram are described in Appendix |^ It is 
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shown that the function TYb (n) is either Umited from above by a constant or 
grows hnearly. The work [^F gives exact values of Gb ("-) for each closed class 
of Boolean functions. The following two theorems characterize the relation 
between TiBin) and Qsin). 

Theorem. (Theorem \3.S\ from Sect. \3.1]) Let B he a closed class of Boolean 
functions, and n a natural number. Let at least one of the following conditions 
hold: 

a) n — 1; 

b) Be{Oi,...,Os,Li,...,L5,Ci,C2,C3}; 

c) B Cz {Ci, Di, D^} and n is odd; 

d) B £ {Di,D2,D3} and n = 2. 

Then Ti.B{n) = GB{n). If none of the conditions a), b), c), d) hold, then 
Hein) <gB{n). 

Theorem. (Theorem \3.3\ from Sect. \3.1]) Let B be a closed class of Boolean 
functions. Then 

a) limn^^nB{n)/gB{n)=0ifBe{Si,S3,S5,Se,Pi,P3,P5,P6}; 

b) nB{n)igB{n) = lifB€ {d, . . . , Og, Li, . . . , L5, Ci, C2, C3}; 

c) lim„_^ ldB{n)/gB{n) = I if B e {d, Mi, ... , M4, Di, D2, D3}; 

d) lim„^oo nB{n)/gB{n) = 1/2 if B e {F^ , ..., F^}; 

e) 1/2 - e{n) < nB{n)/gB{n) < 1 where e{n) = 0{l/y^) if B G 
{Fi^,...,F^} andfi> 2. 

If the number of nodes is estimated in addition to the average weighted depth, 
it is reasonable to combine isomorphic subtrees in decision tree. The resulted 
object is called branching program. A branching program is called read-once 
if in any path from the root to a terminal node, each attribute encounters at 
most once. The following theorem shows that the requirement to a branching 
program to be read-once is rather strong as any branching program with the 
minimum average weighted depth is a read-once branching program. 

Theorem. (Theorem \3.4\ from Sect. \3.^) Let U he a 2-valued information 
system, ^ a weight function for U , z a problem over U , and P a probability 
distribution for z. Let G be a branching program for z that solves z and is 
optimal for ^ , z and P. Then G is a read- once branching program. 

In |66| . it is shown that a read-once branching program implementing the 
function Mult : {0, 1}^" — > {0, 1} (the middle bit of the multiplication of two 
n-bit integers) contains at least 2^^^^^ nodes. In [83l|8H|85j, the function 
n/2 — Clique — Only : {0, 1}" -^ {0, 1} is considered that receives adjacency 
matrix of a graph with n nodes and takes the value 1 if and only if the graph 
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contains a n/2-clique and does not contain any other edges. It is shown that a 
read-once branching program implementing the function n/2 — Clique — Only 
contains at least 2^^^^"^ nodes, while there is a branching program with 0{n^) 
nodes implementing n/2 — Clique — Only such that any attribute appears 
at most twice in each path. In [59 , it is shown that a branching program 
implementing the characteristic functions of Bose-Chaudhuri codes contains 
at least exp{n{y/n/2)) nodes. 

Theorem 13.41 shows that the branching programs that are optimal relative 
to the average weighted depth have the same or greater number of nodes than 
the read-once branching programs with the minimum number of nodes. 

1.2.3 Algorithms for Decision Tree Construction 

Let U = {A, F) be an information system and z = {i^, fi, . . . , /„) a problem 
over U. Let T be a separable subtable of T^. For i E {!,..., n}, denote 
-E(r, i) the set of numbers contained in i-th column of the table T, and 
denote E{T) = {i : i e {1, . . . , n}, \E{T, i)\ > 2}. 

Among decision trees for the problem z that solve z we distinguish ir- 
redundant decision trees. Consider an arbitrary node w of the tree F and 
denote path{F, w) the path from the root to w. Let T — Tzn{path(F, w)) be 
a terminal subtable and v{x) = r on the set of rows of the table T for some 
r € uj. Then w is a terminal node labeled with r. Let T be a nonterminal 
subtable. Then w is labeled with an attribute fi where i E E{T). Finally, 
each node w such that TzTT{path{F, w)) = is labeled with the number 0. 

The following proposition shows that among irredundant decision trees, 
at least one has the minimum average depth. 

Proposition. (Proposition \41\ from Sect, \4-l\ l Let U he an information 
system, ^ a weight function for U , z a problem over U , and P a probabil- 
ity distribution for z. Then there exists an irredundant decision tree that is 
optimal for ^ , z and P . 

Denote by Tree{Tz) the set of irredundant decision trees for the problem z. In 
Sect. 14.11 an algorithm A is described that constructs the set of optimal irre- 
dundant decision trees for the problem z. At the first stage of the algorithm, 
a graph A{z) of separable subtables of the table T^ is constructed. The graph 
in some sense describes all irredundant decision trees for the problem z. Then 
the algorithm reduces the graph A{z) resulting the graph A,p^p{z). The fol- 
lowing theorem is a direct consequence of Proposition 14.21 and Theorem 14.11 
It characterizes the set of trees described by the graph A^p^p{z). 

Theorem. Let U be an information system, ^ a weight function, z a problem 
over U , and P a probability distribution for z. Then the algorithm A given 
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the extended table T* builds a graph A,p^p{z) that describes the set of all 
irredundant decision trees that are optimal relative to the average weighted 
depth. 

For an arbitrary polynomial Q, a probability distribution P is called Q- 
restricted if for an arbitrary row d ^T^, the length of the binary notation of 
the number P{d) does not exceed Q{ri) where n is the number of columns in 
the table. One more theorem formulated in Sect. 14.11 characterizes the time 
complexity of the algorithm A. 

Theorem. (Theorem \4.2\ from Sect. \4-.l^ Let Q{x) be some polynomial. Then 
for an arbitrary problem z — (i^, /i, . . . , /„) and an arbitrary Q-restricted 
probability distribution P for the problem z, the working time of the algorithm 
A is proportional to the number of rows D{Tz) if the table T* is terminal. If 
the table T* is nonterminal, the working time of the algorithm A is bounded 
from below by the maximum of the values n, the number of nonterminal sep- 
arable subtables \S{z)\, D{Tz) and the maximum length of attribute weight in 
binary notation, and is bounded from above by a polynomial on these values. 



1.2.4 Restricted Information Systems 

Chapter 15.21 among all other information systems distinguishes so-called re- 
stricted information systems. The property of being restricted implies a com- 
mon upper bound on the minimum average weighted depth of decision tree 
that depends only on the entropy of probability distribution and holds for all 
problems over the information system. Another property of restricted infor- 
mation systems is that under reasonable assumptions about weight function 
and probability distribution, the working time of the algorithm A is limited 
from above by a polynomial on the number of attributes in the problem 
description. 

For an arbitrary natural number t, a system of equations of the form 
{fi{x) = 5i,..., ft{x) = St} where fi,...,ft £ F and Si,..., St € Ek, is 
called a system of equations over U. An information system U is called r- 
restricted (restricted) if each compatible system of equations over U has an 
equivalent subsystem that contains at most r equations. 

For a system of equations {fi{x) —Si, ... , ft{x) —St} over the information 
system U , the value X]i=i ^ifi) i^ called the weight of the system of equations. 

An information system U is called r-restricted (restricted) relative to ^ 
if each compatible system of equations over U has an equivalent subsystem 
whose weight does not exceed r. 
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Example, f Example 15. il from Sect. 15.-?)) Let A — R", and F be a nonempty 
set of mappings from i?" to R. Consider an infinite family of functions [F] = 
{sign(/ + a) + 1 : f Cz F, a Cz R} (note that the expression (sign(a;) + 1) 
takes the value for a negative x, 1 for x — 0, and 2 for a positive x). If 
\F\ = k < oo, then the information system U = {A, [F]) is 2k-restricted (or 
2k-restricted relative to the weight function ^ = 1). 

The following theorem for an arbitrary problem over a restricted information 
system and an arbitrary probability distribution, gives an upper bound on 
the minimum average weighted depth of decision tree that depends only on 
the entropy of probability distribution. 

Theorem. (Theorem \5.1\ from Sect. 15. J)j Let U he an information system, ^ 
a weight function for U , and U he r-restricted relative to ^ where r is some 
natural numher. Then h^{z,P) < 2r{H{P) + 1) for an arhitrary prohlem z 
over U and an arbitrary prohahility distribution P for z. 

The following theorem shows that the conditions of Theorem 15.11 are neces- 
sary and sufficient for existence of a linear upper bound depending only on 
the entropy and considering non-linear bounds does not extend the class of 
information systems that have upper bounds depending only on the entropy. 

Theorem. (Theorem \5.2\ from Sect. 15. Jj) Let U he an information system 
that is not restricted relative to the weight function ^ for U . Then for an 
arhitrary e > 0, there is no function <P that is limited within the interval [0, e] 
and possesses the condition h^{z,P) < '1>{H{P)) for any prohlem z over U 
and any prohahility distribution P for z. 

Denote Z{U) the set of problems over the information system U . For an 
arbitrary problem z, denote by dimz the number of attributes listed in the 
description of z. 

Consider the functions 

Sjj{n) — max{|S'(z)| : z G Z{U), dim z < n} 

and 

T>u{n) = max{D(T;,) : z € Z{U),dimz < n} 

that characterize the dependence of the maximum number of separable subta- 
bles and the maximum number of rows on the number of columns in decision 
tables over U. 

Let W be restricted from above by some constant, and Q{x) be some poly- 
nomial. Theorem 14.21 implies that for an arbitrary problem z over U and an 
arbitrary Q-restricted probability distribution for the problem z, the time 
complexity of the algorithm A is restricted from above by a polynomial on 
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the number of attributes in the problem description if the functions Suin) 
and 'Dij{n) are restricted from above by a polynomial on n. Also, one can see 
that the time complexity of the algorithm A has an exponential lower bound 
if the function Su (n) grows exponentially. 

Theorem. (Theorem 15.51 from Sect. \5.2^} Let U = {A, F) be a k-valued in- 
formation system. Then the following statements hold: 

a) if U is r -restricted, then Su{n) < (nky + 1 and 'Du{n) < (nky + 1 for 
any natural number n; 

b) if U is not restricted, then Su{n) > 2" — 1 for any natural number n. 



Chapter 2 

Bounds on Average Time Complexity of 

Decision Trees 



In this chapter, bounds on the average depth and the average weighted depth 
of decision trees are considered. Similar problems are studied in search theory 
[I], coding theory [77], design and analysis of algorithms (e.g., sorting) [55] . 
For any diagnostic problem, the minimum average depth of decision tree is 
bounded from below by the entropy of probability distribution (with a mul- 
tiplier l/log2 fc for a problem over a fc- valued information system). Among 
diagnostic problems, the problems with a complete set of attributes have the 
lowest minimum average depth of decision trees (e.g, the problem of building 
optimal prefix code [T] and a blood test study in assumption that exactly 
one patient is ill [23]). For such problems, the minimum average depth of 
decision tree exceeds the lower bound by at most one. The minimum aver- 
age depth reaches the maximum on the problems in which each attribute 
is "indispensable" 01] (e.g., a diagnostic problem with n attributes and fc" 
pairwise different rows in the decision table and the problem of implementing 
the modulo 2 summation function). These problems have the minimum av- 
erage depth of decision tree equal to the number of attributes in the problem 
description. 

We also consider a possibility of problem decomposition. Some problems 
have a hierarchy of attributes: "basic" attributes perform a rough classifica- 
tion, and "extended" ones can be applied to refine the solution. In this case, 
the leaf composition [53] can be applied: a tree for rough classification is built 
using basic attributes only, and then each leaf is replaced with a tree that 
does fine classification using extended attributes only. We are interested in 
finding the conditions that make the tree resulted from such composition to 
have the minimum average time complexity. In this case, applying problem 
decomposition leads to both comprehensive and effective solution. 

The chapter consists of four sections. The first section gives a known bound 
for a diagnostic problem with a complete set of attributes. The second section 
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generalizes the known lower bound and gives an upper bound for the average 
weighted depth, which depends on the parameter M{z) and the entropy of 
probability distribution. The third section gives more precise upper bound 
for the minimum average depth of decision tree. The fourth section describes 
sufficient conditions for problem decomposition which allow synthesizing an 
optimal tree for the initial problem from optimal trees for subproblems. An 
example of decomposable problem is considered that has the minimum aver- 
age depth of decision tree close to the upper bound given in Sect. 12.31 The 
results of this chapter were previously published in [13 [13 [SD [Sll US IM] • 



2.1 Known Bounds 

A problem z ~ (;/, /i, . . . , /„) with s equivalence classes Qi, . . . ,Qs over a 
fc- valued information system U — (A, F) contains a complete set of attributes 
if for an arbitrary partition {1, . . . , s} = UjeB, ^j (where hnlj = if i ^ j), 
there exists an attribute ft G {/i, . . . , /„} such that 

\jQ^^{aeA: ft{a) = j} 
ieij 

for each j G Ek ■ 

The following theorem gives a bound on the average depth of decision 
tree for a diagnostic problem that contains a complete set of attributes. The 
bound follows from coding theory results and is well known in search theory 
(see, for example, [T]). 

Theorem 2.1. Let z be a diagnostic problem with a complete set of attributes 

over a k-valued information system U , and P a probability distribution for 

z. Then 

H(P) , , , -ff(P) 

I0g2 k l0g2 fe 

2.2 Bounds on Average Weighted Depth 

The following theorem generalizes the lower bound to the case of the average 
weighted depth of decision tree for an arbitrary diagnostic problem. 

Theorem 2.2. Let U be a k-valued information system, ^ a weight function 
for U , z a diagnostic problem over U , and P a probability distribution for z. 
Then 

logafc 
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Proof. Let U — {A,F), z = (i^, /i, . . . , /„), and the problem z contains s 
equivalence classes Qi, . . . ,Qs- Let (/q , . . . , ll^i), • ■ • , {Iq, • ■ • , Il-i) be all 
possible partitions of the set {1, . . . , s} possessing the following conditions: 
U -^q/J = {1, . . . , s}, and for any numbers i,j G Ek, i ^ j, ioi t — 1, . . . ,r, 
the relation /* n /j = holds. Define an attribute gt : A ^ Ek, t = 1, . . . ,r, 
as follows. If a e Qi and i ^ Ip then gt{a) = j. Consider the problem 
z' — (v' , fi, . . . , fn, gi, ■ . ■ , gr) over the information system [/' — {A, F U 
{gi,...,gr}) where v' : £'^+'' -^ uj, and v'{Si, . . . ,Sn,Sn+i, ■ ■ ■ ,Sn+r) = 
viSi, . . . , 5„) for each {Si, . . . , J„+r) G ^k^^ ■ According to the definition of 
the attributes gi, . . . ,gr,we have that the problem z' contains a complete set 
of attributes, and z' has the same equivalence classes as the problem z. Evi- 
dently, z'(a) — z{a) for any element a € A. Then z' is a diagnostic problem 
and Theorem 12. II implies 

Let _r be a decision tree for the problem z that solves z. One can see that F 
is a decision tree for the problem z' that solves z' . Then 

h{z,P)>h{z',P) . (2.2) 

Since 'F{f) > 1 for an arbitrary attribute f E F, the relation h,p{z,P) > 
h{z,P) holds. The last inequality, (|2.ip and (|2.2p imply the bound given by 
the theorem statement. D 

The following theorem gives an upper bound on the minimum average 
weighted depth of decision tree for an arbitrary problem. 

Theorem 2.3. Let U be an information system, ^ a weight function for U , 
z a problem over U , and P a probability distribution for z. Then 

h^{z,P) < A'U{z)H{P) + AU{z) . 

The following proposition shows that the additive constant Mx[r{z) in the 
right part of the inequality is inherent. 

Proposition 2.1. For an arbitrary m €z lu \ {0}, there exists a 2-valued in- 
formation system U , a weight function ^ for U , a problem z over U and a 
sequence of probability distributions Pi, P2, . . . for z, such that M,f{z) = m, 
limi^oo H{Pi) = 0, and limi^oo h^{z. Pi) = m. 
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Proof. Let rn £ uj\ {0}. Define a 2-valued information system U as follows: 
U - {A, F) where A = {0,1, . . . ,m}, F = {/i, . . . , /,„} and 

[0 , it I 7^ a , 

for any fi £ F and a € A. Assume that >Z'(/i) = 1 for i = 1, . . . , m. Let 
z = (i/, /i, . . . , /m) be a diagnostic problem. One can see that z has (to + 1) 
equivalence classes Qo = {0}, Qi = {1}, . . . , Qm = {"^}, the table T^ contains 
(to,+1) rows and it is not a terminal table. Consider a probability distribution 
Pi for z, defined as follows: 

- U, if^- = (o,o,...,o), 
\i, if^er,\{(o,o,...,o)}. 

One can see that limi^oo -^(-^1) = 0. Let 6 £ E™ . It is easy to show, that 
M^{z, J) = 1 for (5 7^ (0, ... , 0), and NU{z, 5) = m ior 5 = {0, . . . , 0). Conse- 
quently, M4,{z) ~ TO,. 

Let i £ uj \ {0}, P he a. decision tree for the problem z that solves z, 
and has h^{P,Pi) — h^{z,Pi). Consider a complete path ^ m P such that 
(0, ... ,0) £ T^-K^^). One can see that the length of the path ^ is at least 
TO,. Consequently, hip{P,Pi) > ■mi/{i + m), and h,p{z,Pi) > mi/{i + m). 
Theorem O implies h^{z,P,) < Mq,{z){H{Pi) + 1) = m{H{P{) + 1). Using 
these relations, we have that linii^oo h^{z, Pi) — m. O 



2.3 Upper Bound on Average Depth 

Since the average depth of decision tree is a particular case of the average 
weighted depth, the above considered upper and lower bounds hold for the 
average depth as well. However, the upper bound on the average depth can 
be improved. 

Theorem 2.4. Let z be a problem over an information system U , and P a 
probability distribution for z. Then 

iM{z), ifM{z)<l, 

h{z, P) < < M{z) + 2H{P) , if2< M{z) < 3 , 

[M{^^)+uir^H{P), ^fM{z)>A. 

Theorem 12.61 in Sect. 12.4.31 characterizes quality of the obtained bound. 
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2.3.1 Process of Building Decision Trees Yu,^ 

Let U — {A, F) be a fc-valued information system, 'F a weight function for 
U, z — (v, fi, . . . , fn) a problem over U, and P a probability distribution for 
z. In this section, a process Yjj^if is considered that takes as input z and P, 
and builds a decision tree Yij^^{z, P) that solves the problem z. The bounds 
given by Theorem l2.3l and Theorem 12 .41 are resulted from analysis of decision 
trees built by this process. 

The set F can be uncountable and the function ^ can be incomputable, 
so in general case, the process Yij,p is a way of defining the decision tree 
Yu,<piz, P) rather than an algorithm. 

The process Yu^^, includes a subprocess X^p that builds a decision tree 
X^{z,P,T) by given z, P and an arbitrary nonterminal subtable T of the 
table r,. 

Define a mapping num^ : i7* — > w. For j — 1,2,..., denote by Vj the 
j-th prime number. Let (3 € i7*. If /3 = A, then xwmVziP) — 1- Let (3 ^ X 
and 13 = {fi^,5i) . . .{fi^,5t). Then wxmz{l3) = r^^ x . . . x rj'. The number 
TinuYz{(3) will be called z-numher of the word j3. 

For an arbitrary word a € i?*, denote h{a) the length of the word a and 
denote xi'^) the set of letters from the alphabet Q^ that are contained in a. 

Description of the subprocess Xxp 

Let the subprocess X^ be applied to the triplet z, P, T, where T is a 
nonterminal subtable of the table T^. 

Step 1. For each i £ {1, . . . , n}, assume ai to be the minimum number a 
from Ek for which 

N{T{f,,a),P) = max{N{T{f,,5),P) : S e E^} . 

Denote a — (cri, . . . , a„). Let /3 be the word with the minimum z-number 
among all words in ]7* possessing the following conditions: xiP) ^ {(/ii<''i)i 
. . . , (/„, an)}, the subtable T(3 is terminal, and ^{f3) ~ M^{z, a). Note that 
/3 7^ A, because the subtable T is nonterminal. Let /3 = {fi^ ,ai^)... {fi^ , Ui^ ). 
Denote /i = {fi^ , . . . , /;,„}. Build a tree that consists of a single node. Assign 
the word A to this node. Denote Gi the obtained tree. Proceed to the step 2. 

Let i > 1 steps have been already done and a tree Gt and a set It have 
been built. 

Step (t + 1). Find in the tree Gt the only node w that is assigned with a 
word from ]7*. Denote a the word assigned to w. 

If It = 0, then assign to w the number instead of the word a. Denote 
the resulted tree X^{z,P,T). The subprocess X,i, is completed. 
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Let It ^ $. Let j be the minimum number form the set {1, . . . , n} possess- 
ing the following conditions: fj e /( and 

max{iV(ra(/j, a), P) : a e Ek\ {a,}} 
> inax{N{Ta{f„<j),P) : a e Ek\ {a,}} 

for any fi & It- Assign the attribute fj to the node w instead of the word a. 
For each a ^ Ek, add to the tree Gt a node w^ and the edge that leaves the 
node w and enters Wa-- Assign the number a to that edge. Label the node Wa- 
with the word a{fj, aj) if cr = aj, or with the number otherwise. Denote by 
Gt+i the resulted tree. Assume It+i ^ It\{fj}- Proceed to the step {t + 2). 

Description of the process Yu^qr 

Let the process Fc/,i? be applied to the pair (z, P). 

Step 1. Assume T = T^. Build a decision tree that consists of a single node 

V. 

Let T be a terminal table. Then assign the number vlS) to the node v 
where S is an arbitrary row from T. Denote Yij,^{z, P) the resulted decision 
tree. The process Yjj^^ is completed. 

Let T be a nonterminal table. Assign the word A to the node v and proceed 
to the next step. 

Let i > 1 steps have been already done. Denote G the tree built at the 
step t. 

Step (t + l). If no node in G is assigned with a word from i?*, then denote 
Yu,4'{z,P) the tree G. The process Yu^^ is completed. Otherwise, choose in 
G a terminal node v, which is assigned with a word from ]7*. Denote a the 
word assigned to v. 

Let Ta be a terminal subtable. If Ta = 0, then assign to v the number 
instead of the word a. If Ta ^ 0, then assign to v the number v(S) instead 
of a where 8 is an arbitrary row from Ta. Proceed to the step (i + 2). 

Let Ta be a nonterminal subtable. Apply the subprocess X^p to build the 
decision tree Xi^(z, P, Ta). For each complete path ^ in Xq,(z^ P, Ta), replace 
the number assigned to its terminal node, with the word a7r(^). Denote P 
the tree resulted from this replacement. Replace in G the node v with the 
tree P. Proceed to the step (i + 2). 



2.3.2 Proofs of Theorems [MIE and\24 



This section contains proofs of the upper bounds on the minimum average 
time complexity of decision trees given in Sect. 12.2] and Sect. 
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Lemma 2.1. Let U = (A, F) be a k-valued information system, ^ a weight 
function for U , z — (v, fi, . . . , fn) a problem over U , P a probability distribu- 
tion for z, and T a nonterminal subtable of the table T^. Then the following 
conditions hold for each complete path ^ in the decision tree Xip{z, P,T): 

a) iZ'l^lO) < MA^); 

b) if TTr{^) is a nonterminal subtable, then 

N{T7r{0,P) < ^(T,P)/ max{2X7r(e))} . 

Proof. For each i G {1, . . . , n}, denote Ci the mimmum number from Ek such 
that 

N{T{f,,a,),P) = max{7V(T(/„a),P) : a e Ek} . 

Denote a — (cri, . . . , cr„). Denote P a word from 12* with the minimum z- 
number possessing the foUowing conditions: xiP) ^ {(/ij<''i)i • ■ • j (/n,crn)}, 
T/3 is a terminal table, and ^(/3) = M,p{z,a). Obviously, all letters in the 
word (3 are pairwise different. Using this property of the word (3 and the 
description of the subprocess X^p, one can show that there exists a complete 
path ^0 in the tree X:p{z,P,T) such that x(^(Co)) = x(/3)) S'Hd the words 
7r(^o) and f3 are of equal length. Then Ttt{£_o) is a terminal subtable, and 
tf'(7r(^o)) = ^iP)- Taking into account the choice of the word /3, we have 

^{7Ti^o))^M^{z,a). (2.3) 

Let 7r(^o) — (/jiiO'ii) • • • (/i„i ""jm)- Denote ag = A, and for i = 1,...,to, 
denote a^ = (fj-^ , cr^j ) . . . {fj. , aj. ) . For i = 1 , . . . , to, denote Sj^ the minimum 
number from Ek \ {cj-} such that 

^(Ta,_i(/,,,^,J,F) = max{N{Ta,.iif,^,a),P) : a G ^fc \ {a,J} . 

Let ^ be an arbitrary complete path in the decision tree X^{z,P,T). Let 
^ = ^0. By applying (I13|, we obtain If (7r(^o)) = M^(z, a) < Mq,{z). Let ^ ^ 
^0- One can see that there exist numbers r G {1, . . . , m} and 6 G Ek such that 
^(0 = a,_i(/,v,<5). Therefore, W{tt{0) < if(^(eo)) and tf'(7r(0) < AU{z). 
Part (a) of the lemma is proved. 

Let ^ be an arbitrary complete path in the decision tree X^{z,P,T), for 
which r7r(^) is a nonterminal subtable. The fact that the subtable T7r(^o) 
is terminal implies ^ 7^ ^o- It is easy to see that there exist numbers r G 
{1, . . . , TO.} and 5 & Ek such that 5 ^ Oj^ and 7r(^) = ar-i{fj^, S). 

Let us show that N(Tn{£_), P) < N{T, P)/2. Evidently, 

N{T7T{0,P)<N{T{f,^,S),P). 
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Taking into account the choice of the word aj^, we obtain that N{T{fj^, S), 
P) < N{T{J.j^,a.j^),P). Since 6 ^ a^^, the relation 

N{T{f,^,6),P)+N{T{f,^,<j,J,P)<N{T,P) 

holds. Consequently, N{Ttt{(),P) < N{T,P)/2. 

Obviously, h{Tr{C)) = r. Let r > 2. Let us show that N{Ttt{^),P) < 
N{T,P)/r. Since Sj^^-^ ^ o-ji+i, the inequalities 

N{Ta,+i,P) + N{Ta,{f,^^„S,^^,),P) < N{Ta,,P) 

hold for i = 0, . . . , r — 2. Summing these inequalities by i from to r — 2, we 
obtain 

r-2 

7V(Ta,_i,P) + ^7V(Ta,(/,,^,,5,,^J,P) < N{T,P) . (2.4) 

4=0 

Let us show that for any i G {0, . . . ,r — 2}, 

N{Tn{0,P) < N{Ta,{f,,^,,S,^^,),P) . (2.5) 

The inequality 

N{Ta.{f,^,S),P) < N{Ta,{f,^^,,S,^^,),P) 

follows from the choice of the attribute /jj_|_i (see description of the subprocess 
Xif) and the definition of the number 5j.^-^. The inequality 

N{T7T{0,P)<N{Ta,if,^,S),P) 

is obvious. These two inequalities imply ()2.5|) . The inequality N{TTr{^), P) < 

N{Tar-i,P) is obvious. This inequality, ^^ and ([231) imply rN{TTT{C),P) 

< N{T,P). Since r > 2, the relation iV(r7r(^), P) < N{T,P)/r holds. Part 

(b) of the lemma is proved. D 

Using the description of the process Yu^ip and subprocess X^, and Lemma 
12.11 it is not hard to prove the following proposition. 

Proposition 2.2. Let U be an information system, and W a weight function 
for U. Then for any problem z over U and any probability distribution P 
for z, the process Yjj^^ ends in a finite number of steps. The resulted tree 
Yu,^{z, P) is a decision tree for the problem z that solves z. 
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Proof of Theorem \2.!A If T^ is a terminal table, then the equality h^{Yjj^^{z, 
P)) = follows from the description of the process Yu^xp. This equality and 
ProDOsition l2.2l imply h,f{z,P) < 0. 

Let Tz be a nonterminal table. Consider an arbitrary row d d Tz and find 
the complete path ^'^ in the decision tree Yu^:p{z,P) such that d £ TzTr{^'^). 
From the description of the process Yij^,f and the assumption that T^ is a 
nonterminal table it follows that 7r(^'') = TT{^f) . . . 7r(^^j) for some m G ijj\{0} 
where ^^ is a complete path in the decision tree X,p{z, P, T), and (if m > 2) 
^f is a complete path in the decision tree X^{z, P,TTr{£^f) . . ■TT{^f_i)), i = 
2,...,m. 

By the assumption, the table T^ is nonterminal. If ?7i > 2, then the descrip- 
tion of the process Yjj,^ implies T7r(^f) . . . TT{^f_i) is a nonterminal table for 
i ^ 2, . . . ,rn. Applying part (a) of Lemma ITTl we obtain !f'(7r(ff )) < M^{z) 
for i — \, . . . ,m. Consequently, 

nn^')) = E n^i^f)) < ^M^i^) ■ (2-6) 

Let us show that m < — log2 P{d) +\og2 N{Tz, P) + l. Evidently, the inequal- 
ity holds for 171 = 1. Let m > 2. Part (b) of Lemma [2.11 implies 

One can see that d G Tzii{S,f) . . . '^{£.m-i)- Taking into account this condition, 
we obtain 

N{Tz7i{ii)...i^{C^,),P)>P{d). 

Consequently, 2"-! < N{Tz,P)/P{d) and rn < - logj P(d)-hlog2 iV(r^,P)-F 
1. The obtained inequality and (|2.6p result in 

■f (^(e')) < M^{z){- log2 P[d) + log2 iV(r„ P) + 1) . 

From the definition of the weighted average depth it follows that 

^ M(T m ^^(^) E (log2 N{Tz, P) - log2 P{d) + l)P(d) 

= Af^(z)(p-(P) + l). 
This inequality and Proposition Ohnply h^{z, P) < AU{z){H{P) + 1). D 
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Proof of Theorem \2.4-\ Let z — {v, /i, ■ • ■ /«)• If M{z) — 0, one can see that 
the table T^ is termmal and h{z,P) — 0. 

Let M(z) = 1. Assume that for i = 1, . . . ,n, there exists a number St € 
Ek such that v{x) ^ const on the set of rows of the subtable Tz{fi,5i). 
Denote S = ((5i, . . . , 5„). One can see that M{z, S) > 2, but, according to the 
definition, M{z,S) < M{z) = 1. This contradiction shows that there exists 
an attribute fi E {/i, . . . ,/„} such that for any S G Ek, either Tz{fi,5) is 
empty or i'{x) = const on the set of rows of this table. It is easy to show 
that there exists a decision tree F for the problem z that solves z for which 
h{r,P) = 1. Consequently, h{z,P) < 1. 

Let M{z) > 2. This inequality requires Tz to be a nonterminal table. 
Consider an arbitrary row d G T^ and find a complete path ^"^ in the decision 
tree Yij^h{z,P) such that d S TzTt{(,'^). From the description of the process 
Yu^h and from the fact that T^ is a nonterminal subtable it follows that 
7r(^'') = n{^f) . . . 7r(^jfj) for some m e uj\ {0} where ^f is a complete path in 
the decision tree Xfi{z,P,Tz), and (if m > 2) (,f is a complete path in the 
decision tree Xh^z, P, Tni^l) . . . 7r(^*_i)), i = 2,...,m. Denote rf = h{Tr{^f)) 
for j = 1, . . . , TO. Let us estimate the value /i(7r(^'^)) — J27Li ''f ■ We will prove 
that 



'"(»«'')) < I 






log2 M{z) ^ ^-^ ^ ' if ^/(^) > 4 ^ 

L+iog2iv(r„p)) + Af(z), 

Let m = 1. Part (a) of Lemma [2.11 implies that rf < M{z). Therefore, the 
inequality (j2.7p holds for m = 1. Let m > 2. Denote yf = max{2, rf} for i = 
1, . . . , TO. By the assumption, T^ is a nonterminal table. From the description 
of the process Yu,h it follows that TzTT{£,f) . . . 7r(f^) is a nonterminal subtable 
for i = 1,...,?7T,— 1. Lemma 12.11 and inequality to, > 2 imply 



Since d e TM^f) ■ ■ ■ HC^i), we obtain N{T7T{^f) . . . 7r{^i_,), P) > P{d). 
Consequently, 
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Taking the binary logarithm of both sides results in 



m— 1 



Y, iog2 yf < - iog2 Pid) + iog2 iv(r,, p) . 

1=1 
This inequality implies 

m 771—1 

E '^ = ^™ + E (i°g2 2/7'(^'/ iog2 yf)) 

ni—l 

< ^m + ( E log2 yf)(max{rf/ log2 yf : i € {1, . . . , m - 1}}) (2.8) 

< ri^ {log, P{d)- log, N{T,.,P)) 
X (max{rf / loga yf : i e {1, . . . ,m- 1}}) . 

Consider the function q{x) = a;/log2(max{2,x}), x & lo\ {0}. One can see 
that q(0) = 0, q{l) = 1, q{2) = 2, q(3) < 2, g(4) = 2, and the function q(x) 
is monotonically increasing for a; > 3. Therefore, for any n G w \ {0}, the 
following condition holds: 

{1 , if n = 1 , 

2 , if 2 < n < 3 , (2.9) 

r^^ , if n > 4 . 
log2 71 ' — 

From part (a) of Lemma 12.11 it follows that the inequality 

rf < M{z) (2.10) 

holds for i = 1, . . . , m. From (jT^ . ([TTU)) and the inequahty M{z) > 2 we 
have 



< max{g(i) : i e {0, . . . , M{z)}} = . j^,,^^. 



max{rf / logj yf : i G {1, . . . , m - 1}} 
2, if2<M(z)<3, 



26 2 Bounds on Average Time Complexity of Decision Trees 

These inequalities and inequalities (12. 8|) and (|2.10p imply (|2.7|) . Then 

h{Yu,h{z,P),P) = Y. h{7r{^'))Pid) 

JM{z) + 2H{P) , if2<M(z)<3, 

~[M{z) + t£^H{P), ifM(z)>4. 

Proposition l2.2l results in correctness of the theorem for AI{z) > 2. D 

2.4 On Possibility of Problem Decomposition 

In this section, a possibility of reduction is considered for a problem over 
2-valued information system. Under certain conditions an optimal (relative 
to the average depth) decision tree can be constructed as a composition of 
optimal decision trees for simpler problems that form so-called proper de- 
composition of the original problem. 

2.4-1 Proper Problem Decomposition 

Let U = {A,F) be a 2-valued information system, zo = {i^o, fi, . . . , f^^) a 
diagnostic problem over U with m classes of equivalence Ai, . . . , A„i . Let 
r,„ = {JO, . . . , <J where Jo = (/0(aO, • ■ • , /°„ (a^)), a. € A„ z = 1, . . . , m. 
For i — 1, . . . , ?Ti, let Zi — {vi^ fl, . . . , /^.) be a problem over the informa- 
tion system {Ai , F) with Si classes of equivalence, and the table T^ . contains 
Si rows d\,...,d\.. For i — 1, . . . , m, let Pj be an arbitrary probability dis- 
tribution for the problem z^, and Po = {N{Tz-^,Pi), . . . ,N{Tz^,P,n)) be a 
probability distribution for the problem zq- 

For i = 0, . . . , TO and j — 1, . . . , s^, denote ct* = (ofQ-' a^^'-' . . . a]^) where 



a 



l^ (^E^\k^Q,...,m, and 



JO , if fc = , 



a^ 



d^- , if fc = i 



J0,...,0), iffce{l,...m}\{z} 
Define a function i^ : ^"o-i----+nn. ^ ^^ ^^g follows: 



^2 



Vi{d^j) , if (5 = cr'- for some i e {1, . . . , rnjand j G {1, . . . , s^} , 



I , otherwise 
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Consider a problem z = {v, f^, . . . , /°^, /ii, . . . , /^^, . . . , /f , . . . , /;^) over U, 
where 

for J = l,...,ni, « = l,...,?Ti and a G A. One can see that the ta- 
ble Tz contains the rows a\, . . . , ct]^ , • • • , ct™, . . . , ct^ and does not con- 
tain any other rows. Define a probability distribution P for the problem 
z as follows: P{&)) ~ Pi{d)) for j = l,...,Si and i — l,...,m. The set 
{{zq, Pq), (zi, Pi), . . . , {zm, Pm)) IS Called a proper decomposition of the pair 
(z,P)if: 



i) for J = 1, . . . , rii and « = 1, . . . , to, the inequality 



min;gi^...,„iV(r2,,P() 



holds; 
ii) for any i,j € {1, . . . , ?n}, i 7^ 7 and c G lo such that 

%= ^ p,(J)/iv(r,,,PO>o 

deT^. ,j/i((I)=c 



and 






the inequalities imii{qi,qj) < 1/2 and max(gi,gj) < 1 hold. 

Let ((zo, Pq), (zi. Pi), . . . , (z„i, Pm)) be a proper decomposition of the pair 
(z, P), and Pi be a decision tree for the problem Zi that solves Zi, i — 0, . . . ,m. 
For i = 1, . . . , ?Ti, apply the following transformation to the tree Pi. For each 
nonterminal node w, replace the attribute /* that is assigned to w with the 
corresponding attribute /*. Denote the resulted tree by Pi. 

For i = 1, . . . ,TO, let us find a complete path ^i in Pq such that J° G 
Pzg'K{^i) and replace the terminal node of the path ^i with the tree Pj. Denote 
the resulted tree by <P{Po,Pi, . . . , P„i)- 



2.4-2 Theorem of Decomposition 

Theorem 2.5. Let z be a problem over a 2-valued information system U , P 
a probability distribution for z, and ((zq, Pq), (zi. Pi), . . . , (z„i, Pm)) o, proper 
decomposition of the pair (z, P). Let Pi be a decision tree for Zi that solves Zi 
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and is optimal for Zi and Pi, i — 0, . . . ,ni. Then the tree ^(/q, -Ti, . . . , Fm) 
is a decision tree for the problem z that solves z and is optimal for z and P. 

We preface proof of the theorem by a series of lemmas. Let us define some 
auxihary notions. 

Let _r be a decision tree for the problem z. Denote V{r) and E{r) the 
set of nodes and the set of edges of P respectively. For an arbitrary node 
V G V{P), denote by path{P,v) the path from the root of P to the node 
V. For an arbitrary nonterminal node v G V{P) and an arbitrary number 
S € {0, 1}, denote by e{P,v,S) the edge that leaves v and is labeled with S. 
Let u be a nonterminal node in the tree P and / the attribute assigned to v. 
The node v is called essential if the table TzTT{path{P,v)){f,d) is nonempty 
for (5 = 0, 1. The decision tree is called reduced if all its nonterminal nodes 
are essential. 

Let z be a problem over information system, P a probability distribution 
for z, and ((20,^0)1(2:1,^^1), ■■■,{zm,Pm)) a proper decomposition for the 
pair (z, P). An attribute from the description of the problem z is called basic 
if it is contained in the description of zo, or extended otherwise. 

Let _r be a reduced decision tree for the problem z, and ^ = "^ijei, 
. . . , Wt, Ct, Vt+i a complete path in P where ui, . . . , vt+i G V{P) and ei, . . . , 
et G E{P), t > 1. Let for some k G {1, . . . ,t + 1}, the node Vi be assigned 
with a basic attribute if and only if « < k. Then the path ^ is called ordered 
by basic attributes. 

Let the path ^ be not ordered by basic attributes. For i — 1, . . . ,t, denote 
fi the attribute assigned to the node Vi. Then there exist natural j and 
k, j < k < t such that /j,...,/fe_i are extended attributes, fk is a basic 
attribute, and (if 7 > 1) /i, . . . , /j-i are basic attributes. Denote 

7Vo(0 = N{T,n{path{P,Vk)){fk.^).P) 

and 

N^iO = N{TMpath{P,Vk)){fk, l),P) ■ 

Since Vk is assigned with a basic attribute and _r is a reduced tree, we have 
that for i — j, . . . ,k — 1, the edge e^ is assigned with the number 0. For 
i = j,...,k — 1, denote Wi the node that the edge e{P^Vi,\) enters, and 
denote Ci the number such that Tz'!^{path{P, Wi)) = TzT:{path{P, Wi)){fk, Ci). 
The path ^ is called reducible ii k > j + 3 and the set {aj , . . . , Cfc-i} contains 
both and 1. 

Define a path reduction operation. Let the operation be applied to the 
path 5. 
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Step 1. For S = 0,1, denote r{6) the subtree that the edge e{r,Vk,S) 
enters, and denote e{6) = e{r,Vk,S). If Vj is not the root of F, then reroute 
the edge ej_i so that it enters the node Vk- Proceed to the step 2. 

Let i steps have aheady been done for some I < i < k — j. 

Step (i + 1). Reroute the edge e(crj_|_i_i) so that it enters the node Vj+i-i. 
Denote e{aj+i-i) — e(_r, Wj+i_i,0). Proceed to the step (i + 2). 

Step {k — j + 2). For S = 0,1, reroute the edge e{d) so that it enters the 
subtree r{d). The transformation is completed. 

Lemma 2.2. Let z be a problem over an information system, P a probability 
distribution for z and D a proper decomposition for the pair {z,P). Let F 
be a reduced decision tree for z that solves z, ^ a complete path in F that is 
not ordered by basic attributes, and F the tree resulted from applying the path 
reduction operation to ^. Then 

a) F is a decision tree for z that solves z; 

b)h{f,P)<h{F,P); 

c) h{r, P) < h{F, P) - (iVo(0 + Ni{S))/N{T^,P) if i is a reducible path. 

Proof. One can see from the description of the path reduction operation 
that r' is a decision tree for the problem z. Let us show that F solves z. 
Let ^ = vi,ei,...,vt,et,vt+i where vi,...,Vt+i G V{F), ei,...,et G E{F), 
t > 1, and for i — l,...,t, the node Vi is assigned with an attribute fi. 
Then there exist natural j and k, j < k < t, such that fj,...,fk~i are 
extended attributes, fk is a basic attribute, and (if j > 1) /i,...,/j_i are 
basic attributes. Let d € Tz he an arbitrary row, and (j) a complete path 
in F such that d G TzTt{(P). Denote (j) the complete path in F that ends 
in the same terminal node as (j). Let us show that d e Tz'k{(J)). If Vj ^ (p, 
then (j) = (j). 11 Vk G (j), then the path (j) is obtained from (j) by deleting 
several pairs consisting of a node and one of its outgoing edges. Therefore 
TzTr{(j)) 3 Tz'k{(J)). Let Vi £ (j) and w^+i ^ (j) for some i G {j, . . . ,k — 1}. 
Since w^+i ^ 0, the edge that leaves the node Vi and is contained in the 
path (j), is labeled with 1. The fact that fi is an extended attribute implies 
that the set of solutions of the equation fi (x) = 1 is contained in one of the 
equivalence classes of the problem zq. Then there exists a number S G {0, 1} 
such that Tz7r(0) = TzTT{4i){fk,S). From the description of the path reduction 
operation it follows that the path (j) is obtained from (j) by deleting several 
pairs consisting of a node and one of its outgoing edges and by adding the pair 
{vk,e{r,Vk,S)). Therefore, Tz'n{(j)) ^ Tz'n{(j)){fk,5), and, taking into account 
the last relation, TzTt{(P) D TzTt{(J)). In general case, TzTt{(J>) D TzTt{(P) and 
d G TzTt{(J)). Then the fact that F solves the problem z implies that F also 
solves z. 
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Let US prove part (b) and (c) of the lemma. Since -T is a reduced decision 
tree, Vk is an essential node. Then for ^ = 0, 1, the table TzgTT{path{r,Vj)) 
contains at least one row in which the attribute fk takes the value d. Denote 
this row (i° and denote Ps — N{Tz^ , Pi^)- For i = j, . . . ,k — 1, denote Wi 
the node which the edge e{r, Vi, 1) enters, denote ct^ the number from the set 
{0,1} such that Tzn{path{r,Wi)) = TzTT{path{r,Wi)){fk,<Ji), and denote 

N, = N{TMpath{r,v,){f„l)),P) . 

Then for 5 — 0, 1, the following relation holds: 

fc-i 

Y, N, + Ns{0>Ps- (2.11) 

i—j,(7i—S 

Consider several cases. 

1) Let aj — . . . — <Jk-i- One can see that ^ is not a reducible path. Then 

h{f, p) = h{r, p) + ^^^^^^^ I ^ TV, - (fc - mi-a, (0 • (2.12) 

The relation (i) from the definition of the proper problem decomposition 

implies Ni < Pi-a for i = j,...,k — 1. Summing these inequalities, we 

obtain 

fc-i 



J2N^<{k-J)Pl-a, . (2.13) 

From (f2TT|) it follows that 

Ni-.^{0>Pi-.,- (2.14) 

The relations (f^T^ . ^J^, (EJi)) imply (b). 

2) Let k = j + 2 and <tj ^ ctj+i. One can see that ^ is not a reducible path. 
Then 

h{r, p) = h{r, p) + ^'j^^^yl^^ . (2.15) 

The relation (i) implies 

iV,+A^,+i<P.,+, . (2.16) 

From (ETTj) it follows that 

N,+, + N,^^,{0 > P.,+. ■ (2.17) 
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The relations dUS]), ^J^i, (PTT]) imply (b). 

3) Let fc > J + 3, aj ^ ctj+i and CTj = crj+2- One can see that ^ is a 
reducible path. Then 

h{r,p)<h{r,p) ^ 



NiT,,P) 
( fe-i \ (2.18) 

\ i=j+2,cri=CTj + i 

The relation (i) implies 

N, + N,+, < P,^^, . (2.19) 

From (fTTTj) it follows that 

fc-i 

The relations (f2A8| . (|2l9|) . ([2:201) ™ply (c). 

4) Let fc > J + 3, <Tj 7^ '''j+i and CTj 7^ ""5+2 ■ One can see that ^ is a 
reducible path. Then 

h{r,p)<h{r,p) ^ 



i—j-^2,(Ti—(Tj 



N{T,,P) 
fc-i \ (2.21) 



The relation (i) implies that 

1 



Nj < ^Pa, . (2.22) 

From (im|) it follows that 

fe-i 

The relations (EHH), dHHl, (fT^ imply (c). 

5) Let k > j + 3, <jj — (Jj+i — . . . — am, and a„i 7^ cr.m+i for some 
m E {j + 1, . . . , fc — 2}. One can see that ^ is a reducible path. Then 
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h{r,p)<h{r,p) + 



J2i=j ^t {m - j) 




N{T,,P) N{T„P) 



(2.24) 



The relation (i) implies Ni < Pi^^./2 for i = j,...,m. Summing these 
inequalities, we obtain 

m 

Y.^^<"-^^^P^-^.- (2.25) 

From (fTTTj) it follows that 

fc-i 

Y, N, + N,_,^iO>Pi-a,. (2.26) 



i— rn+ljCTj— a"„ 



+ 1 



The relations (fT^ . (jT^ . (jT^ imply (c). 

One can see that the cases 1-5 cover all possible combinations of j, k,aj,. . . , 
ak-i- Statement (b) of the lemma holds for each case and statement (c) holds 
for the cases in which ^ is a reducible path, so the lemma is proved. D 

Lemma 2.3. Let U be a 2-valued information system, W a weight function 
for U, z a problem over U and P a probability distribution for z. Let P be a 
decision tree for the problem z, that solves z and is optimal for ^ , z and P. 
Then P is a reduced decision tree. 

Proof. Let U = {A,F). Suppose that P is not a reduced decision tree. Let v 
be an inessential node in P such that the path 4> from the root to the node v 
does not contain any other inessential nodes. One can see that TzTt{4)) ^ 0. 
Denote by / the attribute assigned to v. Then there exists a number a G {0, 1} 
such that TzTT{(f>) — TzTT{(f>){f, a). For (5 = 0, 1, denote Ps the subtree whose 
root the edge e{P,v,6) enters. If v is not the root of P, then denote r the 
edge that enters v and transform P so that the edge r enters the root of the 
subtree P^. Delete from P the node v, the edges e{P, v, 0), e{P, v, 1) and the 
subtree A-o-- Denote P the resulted tree. One can see that _r is a decision 
tree for z. Let us prove that P solves the problem z. 

For an arbitrary row d G Tz, denote ^'^ the complete path in P such that 
Je Tz'^{£.'^)- Since T^7r((/i) = Tz7r{(j)){f,a), the terminal node of the path ^'^ 
is not contained in -Ti-o- and was not removed by the transformation. De- 
note £,'^ the complete path in 7^ that ends in the same terminal node as ^''. 
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If w ^ £,^, then the paths C^ and (^ coincide, so T,Tr{(,^) = T^-k(S,^) and 
!f(^'i) = !f(^'^). If u g ^^^ then the path ^'^ is resulted from ^'^ by re- 
moving the node v and the edge e„. Then <f'(^'') = !f'(^'^) — !^(/)- Since 
r27r(0) = T^TT{(j)){f,(T), the relation T^tt{^^) ^ T^n{i^) holds. Then the fact 
that P solves z, implies that F solves z. The relation Tz7r(0) ^ implies 
?; e ^^ for some d G T^. Then h^{r,P) < h^{r,P)-^{f)P{5)/N{T^,P) and 
h^{r, P) < h,i,{r, P). The latter equality contradicts optimality of the tree P 
and the resulted contradiction proves the lemma. D 

Lemma 2.4. Let z he a problem over an information system, P a probability 
distribution for z, and D a proper decomposition of the pair {z,P). Then 
there exists a decision tree for the problem z that solves z and is optimal for 
z and P , in which each path is ordered by basic attributes. 

Proof. Let _r be a decision tree for z that solves z and is optimal for z and P. 
Denote W{r) the set of nodes in P such that any node w S W{r) is assigned 
with a basic attribute and at least one node in the path path{r, w) is assigned 
with an extended attribute. Obviously, if W{P) ~ 0, then all paths in P are 
ordered by basic attributes. 

Let W{r) 7^ 0. Consider an arbitrary complete path ^ in P that is not 
ordered by basic attributes. Let w be the first node in ^ that is contained in 
W{P). According to Lemma lOj P is a reduced decision tree. Let us apply 
to ^ the path reduction operation and denote the resulted tree P. Lemma l2?2l 
implies that -T is a decision tree for z that solves z and is optimal for z and 
P. One can see that W{r) C W{P) \ {w}. Let us apply the above-mentioned 
transformation to P and repeat this procedure until W{P) = 0. Thus the 
desired decision tree is obtained in a finite number of steps. D 

Let z be a problem over an information system, P a probability distribution 
for z and ((zqi ^o)i (zi, Pi), . . . , (z^, Pm)) a proper decomposition for the pair 
(z,P). We will say that the tree P is completely ordered if for each row 
d G Tzg, there exists a node w'' in P such that Tzg{path{P, v"^)) = {d}, and all 
nodes in the path path{P, v'^) are assigned with basic attributes with possible 
exception of v"^. 

Lemma 2.5. Let z be a problem over information system, P a probability 
distribution for z, and D a proper decomposition for the pair {z,P). Then 
there exists a decision tree for the problem z that solves z, is optimal for z 
and P , and is completely ordered. 

Proof Let D = ((zq, Fq), (2^1, -Pi), ■ • • , (z™, P^)), and z, = {v,, f{, . . . , /* .) for 
i — 0, . . . , m. Let Ai, . . . , Am be the equivalence classes of the problem zq, 
r,„ = {~5\ ..., J™}, and ~5' = {f^a,), . . . , /°„(aO), a, S A, for z = 1, . . . , m. 
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According to Lemma 12.41 there exists a decision tree F for the problem z 
that solves z and is optimal for z and P, in which each complete path is 
ordered by basic attributes. We will say that rows S^,P S Tz„, i ^ j are 
separated in F with basic attributes if there is a node v €z F that is assigned 
with a basic attribute /°, and for some number S G {0,1}, the relations 
S' e T,„Tr{path{F, v))if^,6) and S^ G F^^iT{path{F, v)){f^, 1-5) hold. Denote 
R{F) the number of unordered pairs of rows in the table Tj^,, which are not 
separated in F with basic attributes. Obviously, if R{F) = 0, then F is a 
completely ordered decision tree. 

Let R{F) ^ 0. Let 6' = (61, . . . , SI, J and S^ ^ (Si, . . . , 6iJ be rows of F^„, 
which are not separated in F with basic attributes. Choose a number r G 
{1, . . . , no} such that (5* ^ S-}.. Consider a row d — (d?, • • • , d^f,, • • ■ , rf™, • • • , 
d™^) G Fz for which (d5,...,d°^) — S\ Find a complete path in the 
tree I^ such that d G Tz7r((/)). The inequality (ii) implies that there exists a 
row c = (c?, . . ._, cO„, . . . , cr, . . . , C„) G r, such that (c?, . . . , 0° J = J' or 
(c]*, . . . ,c°^) = S^ , but c ^ FzTT{(f>). Then at least one node in the path is 
assigned with an extended attribute. Denote vi the first node of the path 
(/) that is assigned with an extended attribute, and denote G the subtree of 
F whose root is vi. Denote ^ a complete path in G such that each edge is 
assigned with the number 0. Let ^ — vi,ei, ... ,vt,et, Vt+i where vi, . . . ,vt G 
V{F) and ei, . . . , et G E{F). Consider two cases. 

1) Let TzTr{path{F,vt+i)) = F,_TT{path{F,vt+i)){f^,a) for some a G {0, 1}. 
Denote k the minimum number, for which the relation FzT:{path{F,Vk)) = 
Fz'!r{path{F, Ufc))(/°,cr) holds. Denote e^-i = e(F, t!fc_i, 1), and denote w 
the node, which the edge et-i enters. Assign the attribute /° to the node 
Wfc_i, the number a to the edge ek-i and the number (1 — cr) to the edge 
e/c-i. Denote F the resulted decision tree. In order to show that 7^ solves 
z, it is sufficient to prove correctness of the equality Tzn(path(F, w)) = 
FzTT{path{F, w)){fl! , 1 — cr). Since the node vi is assigned with an extended at- 
tribute and the path ^ is ordered by basic attributes, the node Vk-i is assigned 
with an extended attribute. Then TzTT{path{F, w)) = FzTT(path{F, u)))(/", ui) 
for some tJi G {0, 1}. By the choice of the number k, the relation 

TzTr{path{F,Vk-i)) ^ FzTr{path{F,Vk-i)){fr ,<7) 

holds. Therefore, 

TzTr{path{F,w)) = FzTr{path{F,w)){f" , 1 - cr) , 

and F solves the problem z. Obviously, h{r, P) — h{F, P) and the tree 7^ is 
optimal for z and P. Denote ^ the complete path in the tree F that ends in 
the node Vt+i. Let us apply to ^ the path reduction operation and denote the 
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resulted tree F. From Lemma [2.21 it follows that F is a decision tree for the 
problem z that solves z and is optimal for z and P. One can see that each 
complete path in F is ordered by basic attributes, and R{r) < R{F) — 1. 

2) Let T,Tr{path{F,vt+i)) ^ T,Tr{path{F,vt+i)){f°,cT) for a = 0, L Then 
the inequalities (i) and (ii) from the definition of the proper problem de- 
composition imply that t > 3, and for some fci,fc2 G {l,...,t}, the node 
Vki is assigned with an attribute from the set {/{, . . . ,/,\.}, and the node 
Vk2 is assigned with an attribute from the set {/(,..., f^.}. Denote a the 
number assigned to the node Vt+i- Assign the attribute /° to the node 
wt-i-i, add two edges leaving this node and label them with the numbers 
and 1 respectively. Add to the tree F two nodes wq and wi, assign 
the number a to these nodes and transform the tree F so that the edge 
e{F,Vt+i,a) enters the node Wa- for ct = 0, 1. Denote the resulted tree F. 
One can see that Z' is a decision tree for the problem z that solves z and 
h{r, P) = h{F, P) + N{F^tt{^),P)/N{F:„ P). Denote f~ the complete path in 
the tree F that ends in wq. One can see that Nq{X) + Ni{i) = NiT^.Tri^), P). 
Apply the path reduction operation to the path ^ and denote the resulted 
tree F. From Lemma [2.21 it follows that F \s a, decision tree for the problem 
z that solves z, and h{r,P) < h{r,P) - (7Vo(0 + ^\{l))l^{Tz.P)- Then 
h{r,P) < h{F,P) and the decision tree F is optimal for z and P. One can 
see that each complete path in the tree F is ordered by basic attributes and 
R{r) < R{F) - 1. 

Let us apply the above-mentioned transformation to F and repeat this 
procedure until R{F) = 0. Thus we obtain the desired decision tree in a 
finite number of steps. D 

Proof of Theorem \2.5[ Let us show that the decision tree <P = ^(/o, A, • ■ • , 
Fjn) solves the problem z. Denote hy Ai, . . . , Am the equivalence classes of the 
problem zq. Let T^^ = {J\ . . . , J™} where_d^ = (/i (oi), . . . , /"^(ai)), at £ Ai, 
i = 1,. . . ,m. Consider an arbitrary row S — (5°, . . . , (5°^^ , . . . , S"\ . . . , (5™^) G 
Fz- Let {Si, ... , (5°jj) = ct for some i G {!,..., m}. Denote ^^ the complete 
path in the decision tree <l> such that 6 G Tzn{£,^). From the definition of the 
tree <1> it follows that the terminal node of the path ^^ belongs to the subtree 
Fi. Since the decision tree Pi solves the problem Zi, the terminal node of the 
path ^i is assigned with the number Vi{S\, . . . ,S!^.). From the definition of 
proper decomposition we have 1^(6) — i^iiSl, . . . , (5^.). Therefore, <? solves the 
problem z. 

Let us show that <1> is optimal for z and P. From Lemma 12.51 it follows 
that there exists a decision tree G for the problem z that solves z, is opti- 
mal for z and P, and is completely ordered. For i = 1, . . . ,m, denote Vi the 
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node of the tree G such that Tz^TT{path{G,Vi)) = {d*} and all nodes in the 
path path(G,Vi), are assigned with basic attributes with possible exception 
of the node Vi. For i — 1,. . . ,m, delete from the tree G the subtree whose 
root is Vi, but leave Vi itself. Assign to Vi the number Vf){d^). Denote the 
resulted tree Gq. According to Lemma [531 G is a reduced decision tree. It 
implies that all nonterminal nodes of the tree Gq are encountered in the 
paths path(Go,vi), . . . ,path{Go,Vm)- Then all nonterminal nodes in Go are 
assigned with basic attributes and Go is a decision tree for zq. One can see 
that Go solves zq. Then, 

h{Go,Po)>h{zo,Po). (2.27) 

Let z, = {v,Jl, .^ . , /^J for i = 0, . . . , m, and z = (y, /?, . . . , /",,, /i , . . . , 

7ni 7 ■ • ■ w 1 5 • ■ • 5 J rim ' 

For an arbitrary i e {1, . . . , to}, consider the subtree Gi of the tree G whose 
root is the node Vi. By definition, f-j. = in the set Ai for any j G {1, . . . , to}\ 
{i}, k € {1, . . . ,nj}. Then the fact that G is a reduced decision tree implies 
that all nonterminal nodes of the tree Gi are assigned with attributes from 
the set {/{,..., /^.}. For each nonterminal node w in Gi, let us replace the 
attribute /j assigned to w with the corresponding attribute /j. Denote Gi 
the resulted tree. One can see that G, is a decision tree for Zi that solves Zj. 
Then 

h{G„P,)>h{z,,Pi). (2.28) 

Let us compare the average depth of the trees G and <?. One can see that 

-J m 

and 

h{G,P) = h{G,,Po) + ^^^^ ^^ ^iv(r,,,p,)MgoP.) ■ 

Since Pq, Pi, . . . , Pm are optimal decision trees, the inequalities ()2.27p and 
(|2.28p result in h{<l>,P) < h{G,P). Therefore, ^ is an optimal decision tree 
for z and P. D 
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2.4-3 Example of Decomposable Problem, 



TO — 1 , 


ifn = l, 


TO , 


ifn^2, 


TO+1 , 


if n>3 , 



Theoreni l2 . Sl allows for finding decision trees with the minimum average depth 
for some classes of problems. This section shows that the upper bound on the 
average depth of decision tree given by Theorem l2.4l is close to unimprovable. 

Theorem 2.6. For arbitrary natural numbers m > 2, n, there exists a 2- 
valued information system U^, a problem z"^ over [/," with to," equivalence 
classes and a probability distribution P^ = 1 such that H{P) = nlog2 to,, 



, , N 1 , , , „N (to, + 2)(to,— 1) 

M(z) = {m, tfn^2, and h(z, P) ^ ^ ^ '-n 

2m 



We preface the proof of the theorem by several auxiliary definitions. Let 
TO, > 2, n be arbitrary natural numbers. Define a system of circles B^ in 
a plane. By definition, B^^^ is m non-intersecting circles such that no one is 
enclosed to another. Let the system B^^^ have been already defined. Then the 
system _B,^ consists of to, non-intersecting circles, such that no one is enclosed 
to another, and there is a system of the kind B^^^ inside each circle. A circle 
in _B,"j is called zero order circle if it does not contain any circles from B"^. Let 
for some i < n, circles of orders from zero to (i— 1) have been already defined. 
A circle from B^ is called i-th order circle if the order of all enclosed circles 
is at most {i — 1) and is equal to (« — 1) for at least one circle. One can see 
that _B^ contains s — to," zero order circles. Denote these circles Ci, . . . , C^. 
For i — 1, . . . , s, denote a^ a point inside Ci, and denote A — {ai, . . . , a^}. Set 
into correspondence to each circle C from B^^ a function f : A ^ {0, 1}. The 
function / takes the value 1 on an element ai if the point a^ is located inside 
the circle C, and otherwise. Denote F = {/i, . . . , ft} the set of functions 
that correspond to all circles from _B^. Then U^ = {A, F). 

Let z"^ = (i/, /i, . . . , /t) be a diagnostic problem over C/^. The following 
lemma gives the value of the parameter M{z) for the problem z^. 

Lemma 2.6. Let m>2,n be arbitrary natural numbers. Then 

m — 1 , if n = 1 , 

M{zl)^{m, ifn = 2, 

m ~\- 1 , if n > 3 . 
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Proof. Consider the case n > 3. Let us calculate M{z,S) for an arbitrary 



X 



tuple 6 G {0,1}*. Let 6 = (0, ...,0). The system of equations {/i( 
0, . . . , fm{x) — 0} does not have a solution on the set A if /i, . . . , /,„ are 
pairwise different attributes corresponding to {n— l)-th order circles. There- 
fore, M{z!^,S) < m. Let 6 ^ (0, ...,0). Denote Cq a circle of the small- 
est order such that its corresponding attribute takes the value 1 on S. De- 
note no the order of the circle Co and /i„ its corresponding attribute. If 
no = 0, then the equation fig (x) = 1 has a single solution on the set A, and 
M(z^, 6) ~ 1. Let no > 1. Denote fi^, . . . , fi^ the attributes corresponding 
to the (no — l)-th order circles that are enclosed into Cq. By the choice of 
Co, the attributes /^j , . . . , fi^ take the value on 6. The system of equations 
{fig (x) — 1, /ij (.t) = 0, . . . , fi^ (x) — 0} does not have a solution on the set 
A, so M{z;^,S) < rn + 1. Therefore, 

M{zl^)<m + 1. (2.29) 

Let C2 be an arbitrary second order circle. Consider a tuple 5 — (di, . . . ,dt) 
in which the values of the attributes corresponding to C2 and all circles that 
contains C2 are set to 1, and all other elements are set to 0. Let us show 
that M{zl^,S) > m + I. Denote fig the attribute corresponding to C2, and 
/ij , . . . , fi^ the attributes corresponding to the first order circles enclosed 
into C2. Let S = {fj-^ (x) = Sj-^, . . . , fj^ {x) — 5j^ } be an arbitrary system of 
equations that either does not have a solution on the A or z^{x) = const 
on the set of solutions. Let I € {1, . . . , fc}. Replace the equation fji{x) = Sj^ 
with the equation fi^ (x) = if the circle corresponding to the attribute fj, 
is either enclosed into the circle corresponding to fi^ or coincides with it 
for some r e {l,...,rn}. Otherwise, replace the equation fji{x) = Sj^ with 
the equation fig{x) = 1. Let us make such replacement for I — l,...,s, 
and denote the resulted system Si. One can see that the system Si contains 
at most k equations, the set of solutions of Si on A is a subset of the set 
of solutions of S on A, and 5*1 is a subsystem of the system 6*2 — {fiaix) = 
1, /jj (x) = 0, . . . , fi^ {x) = 0}. Assume that 5*1 ^ S2. One can see that in this 
case z^{x) ^ const on the set of solutions of Si on A. But this is impossible. 
Therefore, k > rn-|- 1. Then any system of equations that either do not have a 
solution on A or have a set of solutions coinciding with some equivalence class 
contains at least (m -I- 1) equations, and M(z^, 5) > m + l. This inequality 
and (|2.29p imply M{z^) = m + l. The cases n = 1 and n = 2 are considered 
similarly. D 



Proof of Theorem \2.(A We apply induction on n. Let n = 1. Define a decision 
tree P for the problem 2;^. The decision tree P contains [m — 1) nontermi- 
nal nodes ?;i, . . . , Vm-i, which are assigned with pairwise different attributes 
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/i , . . . , frn- 1 , and m terminal nodes w,„ ,wi, . . . , Wm~ i ■ For i = 1,...,?77,— 1, 
two edges leave the node Vi that are labeled with the numbers and 1 
respectively. The edge labeled with enters the node w^+i and the edge 
labeled with 1 enters the node Wi. For i = l,...,?7i— 1, the node Wi is as- 
signed with the number zl^(ai) where a^ is an element of the set A such that 
fi{<ii) = 1- The node Vm is assigned with the number ^^(ao) where ao is an 
element of the set A such that fi{ao) = ior i — 1, . . . ,m — 1. The decision 
tree F does not contain any other nodes and edges. One can see that F solves 
the problem z^^ and is optimal for z^^ and P^^. Let us calculate the average 
depth of F: 



- 771—1 

h{F,P:^)^-{J2^ + {m-l)) 



(m + 2)(TO- 1) 



m ^-^ 2m 

Then, h{z]^, P„\) = (to + 2)(m - l)/(2m). 

Let n be a natural number greater than 1 such that the theorem holds 
for all natural numbers less than n. Denote Ci, . . . , Cm. the (n — l)-th or- 
der circles contained in the system i?^, and /i,...,/™ their correspond- 
ing attributes. Consider a decomposition ((zqi ^0)1 (-^i: P\)^ • ■ • j (-^m, Pm)) of 
the pair (2^^,P"J. The diagnostic problem zq contains only the attributes 
/i)---:/m- For i — 1,...,TO,, the problem zi contains all attributes cor- 
responding to the circles enclosed in C^, and Zi(x) is the mapping z^ : 
A —^ uj restricted to the set {a € A : /i(a) = 1}. Let for i = 0, ...,to,, 
Pi be a uniform probability distribution for the problem Zi. One can see 
that {{zq, Pq), (zi, Pi), . . . , (zrm Pm)) is a proper decomposition of the pair 
(zli^P^;^), hizo,Po) = h{zl,Pi) and h{z,,P,) = h{z^-\P:^-^) for i = 
1, . . . , TO. Using induction hypothesis, we obtain h(zQ, Pq) = {I'n- + 2)(to, — 1) 
/(2to) and h{z,,P,) = [{m + 2){m- 1) / (2m)]{n - 1) for i = 1, . . . ,to. Let Fi 
be a decision tree for the problem Zi that solves Zi and is optimal for Zi and 
Pi, i = 0, . . . ,m. Let <^ — ^{Fq, Pi, ... , Fm)- From the definition of the tree 
^ it follows that 

Urf, pn, ,,r P^^V^M^ll^ (m + 2)(TO-l) 

h{^, Pm) - HPo, Po) + 2^ = ^ n . 

i—1 

Using Theorem[231 we obtain h{zl^, P^) = [{m + 2){m - l)/{2m)]n. U 



Chapter 3 

Representing Boolean Functions by 

Decision Trees 



A Boolean or discrete function can be represented by a decision tree. A com- 
pact form of decision tree named binary decision diagram or branching pro- 
gram is widely known in logic design [2 HD] . This representation is equivalent 
to other forms, and in some cases it is more compact than values table or even 
the formula |44| . Representing a function in the form of decision tree allows 
applying graph algorithms for various transformations [TU]- Decision trees 
and branching programs are used for effective hardware [15j and software 
[5] implementation of functions. For the implementation to be effective, the 
function representation should have minimal time and space complexity. The 
average depth of decision tree characterizes the expected computing time, 
and the number of nodes in branching program characterizes the number of 
functional elements required for implementation. Often these two criteria are 
incompatible, i.e. there is no solution that is optimal on both time and space 
complexity. 

The chapter considers several problems of representing functions in the 
form of decision trees. It consists of two sections. The first section studies the 
average time complexity of representing Boolean functions by decision trees. 
The complexity of a class of functions can be characterized by a Shannon type 
function H{n) that shows the dependence of the minimum average depth of 
decision tree in the worst case on the number of function arguments. For each 
closed class of Boolean functions B, a lower and an upper bound on Hsin) 
are given. Analogous results for the depth of decision trees are described in 
[48j . The second section considers branching programs with the minimum 
average weighted depth. It is proven that such programs are read-once, i.e. 
each attribute is checked at most once along each path. This fact implies 
high lower bounds on the number of nodes in branching programs with the 
minimum average weighted depth for several known functions. 

Results of this chapter were previously published in [19j . 
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3.1 On Average Depth of Decision Trees Implementing 
Boolean Functions 

This section contains some auxiliary notions followed by propositions that 
give bounds on the function H{n) for all closed classes of Boolean functions. 
The notation of closed classes of Boolean functions is in accordance with [36] ; 
the classes and the class inclusion diagram are described in Appendix [XI 

3.1.1 Auxiliary Notions 

A function of the form / : E2 -^ E2 is called Boolean function. The constants 
and 1 also are Boolean functions. 

Let /(xi, . . . ,x„) be a Boolean function. A variable Xi of the function / 
will be called essential if there exist two n-tuples 5 and a from _E^ which differ 
only in the i-th digit and for which f{5) ^ /(ct). Variables of the function / 
which are not essential will be called inessential. 

Let us set into correspondence to a Boolean function f{xi, . . . , a;„) a prob- 
lem z = (/, cci, . . . , Xn) over the information system C/„ — (-EJ, {xi, . . . ,Xn})- 
The problem z has two equivalence classes Qo and Qi containing the sets of 
binary tuples on which / takes the value and 1 respectively. A decision tree 
solving the problem z is called a decision tree implementing f. Denote g{f) 
and h{f) respectively the minimum depth of a decision tree implementing / 
and the minimum average depth of a decision tree implementing / relative 
to the probability distribution P = 1. 

Denote dim / the number of arguments of the function /. Let _B be a set 
of Boolean functions. Consider the functions 

GBin) = max{g(/) : f £ B, dim f < n} 

and 

nsin) = max{/i(/) : / € B,dim/ < n} 

that characterize the growth in the worst case of the minimum depth and 
the minimum average depth of decision trees implementing Boolean functions 
from B with growth of the number of function arguments. Note that Tisin) < 
Gsin) for any n. 
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3.1.2 Bounds on Function Tisin) 

In this section, a number of propositions are formulated that for each closed 
class of Boolean functions B give the upper and the lower bound on Hb (n) 
followed by two theorems that compare the values Gsin) and T(.B{n). 

Proposition 3.1. For B E {02,03,07}, the relation HBin) ~ holds. 

Proposition 3.2. For B E {Oi, O4, O5, Og, Os, O9}, the relation Hsin) = 1 
holds. 



Proposition 3.3. For B e {5*1, ^3, 85,8^, Pi, P3, P5, Pq}, the relation 



2-^, ifn>2, 



1 , ifn = l. 

holds. 

Proposition 3.4. For B G {Pi, P2, -^3, Oi, O2, O3}, the relation Hsin) = n 
holds. 

Proposition 3.5. For B E {L/^^Lc,^, the relation 

in , ifn = 2k + l,k>0, 

[n-1 , ijn = 2k,k>l 

holds. 

Proposition 3.6. For B = O4, the relation 

in , ifn = 2k + l,k>0, 

nBin) = { ' . - . 

In — ^„_i , if n = 2k, fc > 1 

holds. 

Proposition 3.7. For B E {DijD^}, the following relations hold: 
0-) Ti-Bin) = n, if n ~2k + l, k > 0; 
b) n- 1.7/V^ < HBin) <n- l/2"-i, ifn = 2k,k> 1. 

Proposition 3.8. For B E {Mi, M2, M3, M4}, the relation 

n + 1 - VnTl <'HB{n)<n- [n/2\ 2-L"/2j 
holds. 
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Proposition 3.9. For B = D2, the relation 

n + 1/2- xATTT <nB{n)<n- [n/2\ T L"/^] 
holds. 

Proposition 3.10. For B e {F^ ,F^ ,F§° ,F^}, the relation H sin) = {n+ 
l)/2 holds. 

Proposition 3.11. For B e {F^, F^, F^, F^}, the relation 

1 + (n - Vn)/2 < T-iBin) < {n + l)/2, n > 1 

holds. 

Proposition 3.12. For B e {F^, F^, F^ ,F^}, ^i>2, the relation 

{n + l)/2<nB{n)<n- \n/2\ 2" L«/2J 

holds. 

Proposition 3.13. For any B e {F^,FJ^,F^, F^}, fJ.>2, the relation 
l + {n~y/^)/2<nBin) <n- [n/2j2"L«/2j 

holds. 

The following theorem is proved by Moshkov. 

Theorem 3.1 ([48]). Let B be a closed class of Boolean functions and n be 
a natural number. Then 

a) if Be {02,03,07}, then ^^(n) = 0; 

b)ifBe {Oi, O4, 05,06,08, O9}, then ^sW = 1; 

In, if n is odd , 

c) if B e {£4, L5}, then GB{n) — < 

I n — 1 , if n is even ; 

\ n , if n > 3 , 
d)tfBe{Di,D2,D3},thengB{n)^{ ' -^ - ' 

[ 1 , ifn<2; 

e) if the class B does not coincide with any classes mentioned in (a) - (d), 

then QB{n) — n. 

The following two theorems immediately follow from the previous theorem 
and Propositions I3.m3.13l These theorems characterize the relation between 
T-iBin) and GB{n) for each closed class of Boolean functions. 
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Theorem 3.2. Let B be a closed class of Boolean functions, and n a natural 
number. Let at least one of the following conditions hold: 

a) n — 1; 
&jBe{Oi,...,09,ii,...,i5,C7i,C2,C3}; 

c) B £ {C4, Di, D:i} and n is odd; 

d) B e {Di,D2,D3} andn^2. 

Then TiBin) = Gsin). If none of the conditions (a), (b), (c), (d) hold, then 
Hein) KGnin). 

Theorem 3.3. Let B be a closed class of Boolean functions. Then 

a) lim„_^oo 'HB{n)/GB{n) ^ if B <E {51,53, 55,56, Fi, P3, P5, Pq}; 
h) TiBin) /GBin) ^ I if B e {Oi, . . . , O9, ii, . . . , L5, ^1,^2,^3}; 
c;iim„_ooHB(n)/^B(n) ^ I if B e {d, A'h, . . . , Ah, Di, D2, D3}; 
d) lim„^oo nBin)/gB{n) = 1/2 if B e {F^ , . . . , F|"}; 

e) 

1 HBin) . 

2 yB(n) 

where e{n) = 0{l/./n) if B e {Pf, . . . , F^} and fi>2. 



3.1.3 Proofs of Propositions [XIUXT^ 

We preface proof of the propositions by a series of lemmas. Since in this 
section the uniform probabihty distribution is assumed, it is omitted in no- 
tations, so the average depth of a tree F is denoted by h{r). 

For an arbitrary path f in a decision tree F, denote its length by IriO- 
Denote the logical negation operation by -> and the modulo 2 summation by 
©. A Boolean function f{xi, . . . ,x„) is called symmetrical if for each tuple 
6 e E2 and each permutation p oi n elements, the relation f{6) = f{p{^)) 
holds. 

Lemma 3.1. Lei fo{xi, . . . ,Xn) and fi(xi, . . . , a;„) be arbitrary Boolean 
functions for some natural number n. Let Fq and Fi be decision trees im- 
plementing /o and /i respectively. Let F be a decision tree of the following 
form: 

a) the root of F is assigned with the attribute x„+i; 

b) for 5 = 0,1, an edge eg leaves the root of F and enters the root of Fs, 
which is labeled with the number S; 

c) F does not contain any other nodes and edges. 
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Then the following statements are true: 
a) the decision tree F implements the function ^x„_|_i A /o V x„+i A /i; 

h)h{r)^i + {h{ra) + h{r^))/2. 

Proof. Consider an arbitrary tuple 5 = ((5i, . . . , (5„+i) and find in the tree F 
the path £^{5) on which computations for 5 are performed. Since the root of 
F is assigned with the attribute x„+i, the terminal node of the path S,{5) is 
located in the tree Fs^^^. Denote 5* — ((5i, . . . , (5„) and denote £^"^+^{5*) the 
part of ^(5) from the root of the tree /5„+i to the terminal node. One can 
see that in the tree F^^^-^^, computations for the tuple 5* are performed along 
the path £f^+^i^*). Since the tree Fs^^^ implements the function fs^+n the 
terminal node of the path £{d) is assigned with the number fs„^i ((5i, . . . , (5„) = 
-^dn+i/\fo{Si, . . . , Sn)ySn+iAfi{Si, . . . , (5„). Therefore, the tree F implements 
the function^x„+i A /o V a;„+i A /i. 

Obviously, the length of the path £{5) is greater by 1 than the length of 
the pathC''"+i((5*). Then 



= 2^HrQ) + 1 + h{Fi) + 1) = 1 + . 



D 



The following lemma gives a combinatorial identity, which will be used fur- 
ther. Denote g{n,t) = J2Lt C*/2*- 

Lemma 3.2. For arbitrary natural numbers n,t < n, the equality g{n,t) = 

2-l/2"E*=oGn+i holds. 

Proof. Apply the following transformations: 

n+l 



i—t i—t i—t—1 

\i=t-l i=t-l J 

= \ \9{n, t) + g{n, t - \) - ^C^^+A . 
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Then we have 

n(n.. f^ = n(n.. t — ^^ — 

2'' 



7(n,<)=5(n,t-l)-i^C*+i. (3.1) 



Let us modify g{n, 1) as follows: 

2 n — 1 n 



\~^ z 1 



n 


1 2 

= 2 + 4+--- 


|_ 


1 n 

1 




' 2n~^ 2" 



21 2 4 2"-i 

i=l 

ri — 1 n n 11 

+ 2"-i + 2"-i ^ 2" ^ +2 + 4 
_1-^ n _ 2" - 1 " _<;, 1 "- + 1 
^ 1 - i 2" ~ 2" 2" ~ 2" 2" " 

Then g(r7,, 1) can be expressed in the following form: 

5Kl) = 2-^(C°+i+Ci+0 • (3-2) 

The equalities ^^ and dSH]) imply g(n, t) = 2-1/2" ^'^^ C^+i- D 

Lemma 3.3. The function 

/(n) = (n + l)-^^|^-(n+|-V^rT2) 
Von + 5 / 

tafces positive values for any natural number n > 3. 
Proof. Convert all terms to the common denominator: 

1 \/2(n + l) 



f{n)^V^{+2- 



2 V3n + 5 

_ 2VnT2V3n + 5 - V3ri + 5 - 2x/2(n + 1) _ 0(n) 
~ 2V3n + 5 ^ 2V3n + 5 ' 

The denominator is always positive, so f{n) has the same sign as (j){n). Apply 
the following transformations: 

(j){n) = {An + 7) - (\/3n + 5 - y/n + 2f - VSn + 5 - 2\/2(n + 1) 

> (4n + 7) - (\/3n + 6 - VnT2f - VSn + 5 - 2V2(n + 1) 

= (4n + 7) - (\/3 - l)2(n + 2) - V3n + 5 - 2\/2(n + 1) 

= 2(\/3 - V2)n + 4\/3 - 2\/2 - 1 - VSn + 5 > 0.6n + 3 - V3n + 5 

= V'W • 

The facts that ip{n) is monotonically increasing and ^"(3) > prove the 
lemma. D 
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For an arbitrary natural number fc, the Boolean function that takes value 1 if 
and only if at least k its arguments are set to 1 is called a threshold function 
with the threshold k. Denote by Thrn^k the threshold function of n variables 
with the threshold k. 



Lemma 3.4. For an arbitrary natural number n, the relation h{Thrn.\n/2\ ] 
n + 1 — \/n + 1 holds, and for an arbitrary odd n > 3, the relation 



> 



h{Thr.a^(n+i)/2) >n+-- \/n + 2 



holds. 



Proof. Denote m — \n/2~\ . Let F be an optimal decision tree implementing 
the function Fhr„^m. Let us transform F as follows. We will process nonter- 
minal nodes layer by layer starting from the root. Let v be the current node, 
r„ the distance from the root to u, and Fy the tree whose root is v. If v is 
assigned with the attribute Xr^+i, then skip this node and proceed to the 
next node. Let v is assigned with an attribute Xg that differs from x^^+i. 
Lemma 12.31 implies that r' is a reduced tree. Then for each path in F, the 
attributes assigned to the nonterminal nodes of this path are pairwise differ- 
ent. Therefore, no node in Fy except the root is assigned with the attribute 
Xs. Assign the attribute a^r^+i to the node v, assign the attribute Xg to all 
nonterminal nodes in Fy which were assigned with the attribute x^^+i, and 
proceed to the next node. 

One can see that Fy is a decision tree implementing the function T/ir„^„i ((5i, 
. . . , Sr^, Xr^+i, ■ . ■ , Xn) for some Si, . . . ,Sr^ € {0, 1}. Since this function is 
symmetrical, the transformation keeps the function implemented by Fy and 
does not change the average depth of the tree. 

Denote F the resulted tree. From the description of the transformation it 
follows that F is an optimal decision tree implementing the function Fhrn^mi 
and for i = 1, . . . , n, all nodes in the i-th layer are assigned with the attribute 
Xi. According to Lemma l^THl F \sa. reduced decision tree. One can see that for 
a tuple 5 = (^1, . . . , (5„) € £'J, the length of the path on which computations 
for 5 are performed is equal to i if and only if one of the following conditions 
hold: 

• (5i = 1, and exactly (m — 1) elements of the tuple (Ji, . . . , 5i^i) are equal 
to one; 

• 5i — 0, and exactly (n — m) elements of the tuple (5i, . . . , 5i^i) are equal 
to zero. 

In other words, the length of the path is the minimum of the position of the 
?7i-th one and the position of the (n — m -\- l)-th zero in the tuple 5. 
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For i = 1, . . . , n, there are 2"~*(C™7^ + C^S^) tuples corresponding to 
the paths in F of the length i. Then the average depth of the decision tree F 
is equal to 

n 

h{r) = 2"" J2 *2""'(c™7i + c^sn ■ 

Applying simple transformations and using the previously introduced nota- 
tion (/(n, i), we obtain h{F) = h{Thrn^m) = "^5("j "i) + (n — to + l)g(n, n — 
TO + 1). Applying Lemma 13.21 we obtain 

_. / m n— m+1 \ 

KThr^^m) = 2(n + 1) - - m ^ C;+i + (n - m + 1) ^ Q+i . 

\ 4=0 i=0 / 

If n is odd, then ra ~ (n + l)/2. Taking into account that (n + l)/2 = 
n- (n + l)/2 + l, and 

("+l)/2 , 

i=0 

as the number of binary tuples of the length (n + 1) containing at most 
{n + l)/2 ones, we have 

hiThr^^ra) = {n + l){l- ^^1"+^'^^') ■ 
Using a known bound from [55] (see Chap. 8, Exercise 8.5.2) 

An 

C'L < ^^=^ , (3.3) 

V3n + 1 

we obtain that h{Thrn,ra) > ("- + 1)(1 - V^/^/Zn + 5) > (n + 1) - VrT+T. 
Applying Lemma [3.31 we obtain the bound h(Thrn^m) > n + 3/2 — \/n + 2 
for any odd n > 3. 

Let n be even. Then m = n/2. Taking into account that X]"=o ^n+i = 2" 
and Erir' C-^+i = 2" + C:i\+\ we have 

/z(T/zr„,„) = (n + 1) - i^ (^ + l) C'^il^' ^ (n + 1) (^l - i^C^/^) . 

The inequality dSS]) imphes h{Thrn^,n) > (n + 1)(1 - \/2/V3n + 2) > (n + 
1) - \ArTT. D 

Let z = (u, fi, . . . , fn) be a problem over a 2-valued information system. A set 
of terminal separable subtables {/i, . . . , Ik} of the table T^ is called compatible 
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if for some natural number I < n, there exist numbers ai, . . . , a; £ {1, . . . , n}, 
and for j = 1, . . . , fc, there exist tuples Si = {SI, ■ ■ ■ ,Sl) S E2, Si ^ Sj for 
i j^ j, such that li — Tz{fanS\) . . . {fa,, Si). We will say that the terminal 
separable subtables /i, . . . , /^ form a partition of the table T^ if lJi=i ^i — "^z, 
and li n Ij = (d for i ^ j- 

Lemma 3.5. Let z = {v, fi, . . . , f„) be a problem over 2-valued information 
system such that T^ = E2 , and P be a probability distribution for the problem 
z. Then the following statements are valid: 

a) for an arbitrary compatible set of terminal separable subtables {/i, . . . , 
Ik} of the table T^, the inequality 



1 '" 

h{z,P)<n- ^^^^^p^ Y.^og2D{h 



)N{h,P) (3.4) 



holds; 

b) there exists a partition of the table Tz for which the relation \3.4-^ holds 
as equality. 

Proof. Let us prove part (a) of the lemma. Let i'{x) = const = Vi on the set 
of rows of the table li for i = 1, . . . , fc. Without loss of generality, assume 
that /, = Tz{fi, S\) . . . {fi,Si) for some SI . . . ,Sl € E2, i = I, . . . ,k. Let us 
build a decision tree P for the problem z in the following way. 

Step 0. Build a complete binary tree of the length {I + 1). For i — 1, . . . ,1, 
assign the attribute fi to each node in the i-th layer and proceed to the first 
step. 

Let t > steps have been already performed. 

Step {t + 1). li each terminal node in the tree P has been already labeled 
with a number, the algorithm finishes. Otherwise, choose in P an unlabeled 
terminal node v. Denote by ^ the path from the root to the node v. If TzTt{£,) = 
li for some i € {1, . . . ,fc}, then label the node v with the number Vi and 
proceed to the next step. Otherwise, replace the node v with a complete 
binary tree Py of the depth {n — I + 1), and for i = 1, . . . ,n — I, assign the 
attribute /;+i to all nonterminal nodes in the i-th layer of the tree Py. Then 
assign to each terminal node w of the tree Py the natural number Oy, defined 
as follows. Denote by (j) the path from the root of the tree P to the node w. 
Since each of the attributes fi, . . . , fn is assigned to a node in the path (j) 
and Tz = E2, the subtable TzTt{(J)) consists of a single row. Denote that row 
5 and assume a^ = >^{S)- Proceed to the step {t + 2). 

One can see that the algorithm finishes after the (2'+^)-th step and the 
resulted decision tree solves the problem z. For an arbitrary tuple S € Tz, 
denote by ^{S) the complete path on which computations for S are performed. 
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From the description of tree building procedure, the length of ^(S) is equal 
to Z if (5 G /i for some i € {1, . . . , fc}, and is equal to n otherwise. Denote 
I — IiU . . .(J Ik- Then the expression for the average depth of the decision 
tree F has the following form: 




6eT^\i 6ei 

k 

iV(2b)B"-0A^(/.P). 



(3.5) 



Since T^ = i?^, each subtable li contains exactly 2"^' rows, and log2-D(/i) 
= n — I. Taking into account the obvious inequality h{z,P) < h{r,P), the 
inequality (|3.5p can be easily transformed into (|3.4p . 

Let us prove part (b) of the lemma. Let F be an optimal decision tree for 
the problem z. Denote by W{F) the set of nonterminal nodes in the decision 
tree F. For an arbitrary terminal node w € W{F), denote by path(w) the 
path from the root to the node w. Since the tree F solves z, TziT{path{'w)) is 
a terminal subtable, Tz'!r{path{wi)) f] Fz'n(path{w2)) = for any two differ- 
ent nodes wi and W2, and UmGVKfr) Tz(T^{pO'th{w))) = Tz. For an arbitrary 
terminal node w e W{F), denote I^ = TzTr{path{w)) and choose the set 
{Iw ■ w G W{F)} as the desired partition. From Lemma [2.31 it follows that 
F is a reduced tree. Then for an arbitrary terminal node w G W{F), the 
nonterminal nodes of the path pathiw) are assigned with pairwise different 
attributes. From the condition T^ — E2 it follows that the number of rows in 
the subtable I^, is equal to 2"-'^(p'^*''("')). Therefore, the length of the path 
on which computations for all rows of the table I^ are performed is equal to 
(n — log2 D{Iw)). Finally, we transform the expression for the average depth 
of the decision tree F: 

Y, lr{path{w))N{I^,P) 



N(Tz,P, 

= ^ J p. E {n-\og2D{I^))N{I^,P) 
^ "" ' wew(r) 

n - ^,^ p. Y. ^og,D{I^)NiI^,P). D 
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Lemma 3.6. For an arbitrary natural number n, the minimum average depth 
of decision trees implementing Boolean functions xi® ■ ■ .®Xn and xi® . . .® 
a;„ ® 1 is equal to n. 

Proof. Let z = {f,Xi, . . . ,Xn) be a problem corresponding to the function 
f{xi, . . . , x„) = xi ® . . . ® Xn- Let us show that T^ does not have terminal 
separable subtables that contain more than one row. Assume the contrary. 
Let there exist a word a G fiz such that T^a is a terminal separable subtable 
containing at least two rows S = {Si, . . . , 6n) and a = (cti, . . . , cr„), S ^ a. 
Let Si y^ ai for some i G {1, . . . ,n}. Then the word a does not contain the 
letters {xi, 0) and (xi, 1). Since T^ = E2, the subtable Tza also contains the 
row S* = {Si, . . . ,Si — 1, -^Si, 6i+i, . . . , Sn)- Since f{S) ^ fiS*), we obtain a 
contradiction with the assumption that Tza is a terminal subtable. Therefore, 
all terminal separable subtables of Tz consist of a single row. From part (b) 
of Lemma IXSl it follows that h{f) — n. The lemma is proved analogously for 
the function zi © . . . ® a;„ ® 1. D 

Lemma 3.7. Let f{xi, ...,Xn) be an arbitrary non- constant Boolean function. 
Denote by f one of the functions f{xi, ...,Xn) /\ Xn+i, f{xi,...,x„)/\^Xn+i, 
f{xi,...,Xn)\/x„+i, f{xi,...,Xn)\/^Xn+i. Then the relation h{f) = l+h{f)/2 
holds. 

Proof. Let us prove the lemma by induction on the number of essential vari- 
ables of the function /. Obviously, each function that have a single essential 
variable can be represented (up to inessential variables) in the form f{x) = x 
or f{x) = -'X, and the lemma is valid for these functions. 

Let the lemma be valid for all functions with at most (i — 1) essential 
variables for some i > 1. Let / be a function with t essential variables. 
Denote / = /(xi,...,x„) Aa;„+i. 

Let us build a decision tree T in the following way. The root of T is assigned 
with the attribute Xn+i- Two edges leave the root labeled with the numbers 
and 1. The edge labeled with enters a terminal node which is labeled with 
the number 0. The edge labeled with 1 enters the root of an optimal decision 
tree implementing the function /. The decision tree P does not contain any 
other nodes and edges. It is easy to see that T implements the function /. 
According to Lemma [3. 11 

W = l + ^- (3.6) 

To prove the lemma it is sufficient to show that T is an optimal decision 
tree. Assume the contrary. In this case, there exists an optimal decision tree 
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r whose root is assigned with an attribute other than x„+i. Assume without 
loss of generality that it is the attribute xi. For 6 — 0,1, denote es the edge 
that leaves the root of F and is labeled with the number 6, and denote Fs the 
decision tree whose root the edge eg enters. One can see that the decision tree 
Fs implements the function f{d,X2, ■ ■ ■ ,Xn) A Xn+i- Since / has t essential 
variables, at least one of the functions /(O, X2, ■ ■ ■ , Xn), /(I, X2, ■ ■ ■ , a;„) is a 
non-constant function. Let both functions possess this condition. Then the 
induction base implies that for 6 = 0, 1, the relation 

h{f{S,X2,...,Xn) /\Xn+l) ^l-\ (3.7) 

holds. From Lemma [XT] it follows that 

h(F) = 1 + i (^1 + HfiO,X2,...,Xn)) ^ ^ ^ h{f{l,X2,...,Xn)) 
Zi \ £i Zi 

Then 

h(F) = 2 + ^(/(O' ^2, ■ ■ ■ , Xr,)) + /t(/(l, X2,..., Xn)) ,^ g. 

Let us build a decision tree G as follows. The root of G is assigned with the 
attribute x\. Two edges leave the root labeled with the numbers and 1. For 
8 = 0, 1, the edge labeled with the number 6 enters the root of an optimal 
decision tree for the function f(6,X2, ■ ■ ■ ,Xn)- The tree G does not contain 
any other nodes and edges. One can see that G implements the function /. 
According to Lemma \3A\ 



h(G) = 1 + ^(/(Q' ^2, ■ ■ ■ , Xn)) + /t(/(l, X2,..., Xn)) ,g g. 

Taking into account the inequality h{f) < h{G) and substituting (|3.9p 
into p.8p . we obtain that h{F) > 3/2 + h{f)/2. Comparing the last re- 
lation to p.6p . we have h{F) > h{F) that contradicts the assumption 
that F is an optimal decision tree. Therefore, only one of the functions 
/(O, X2, ■ ■ ■ , Xn), /(I, X2, . • ■ , Xn) is uou-constant. Suppose for the definite- 
ness that f{l,X2,---,Xn) = const. Then / can be represented in the form / = 
xi V/(0, X2, ■ ■ ■ , Xn) or / = ^a;i A/(0, X2, . ■ ■ , Xn)- The induction base implies 
that h{f) = 1 -f- h{f{0, X2, ■ ■ ■ , a;„))/2. The function /(O, X2,. ■ ■ , x„) A Xn+i 
is non-constant, and for 6 = 0, the relation (|3.7p holds. Then the relation 
h{f{0, X2, ■ ■ ■ , Xn) A a;„+i) = h{f) holds and implies 

h( f) = h(F) = 1 + '-^ + ^(/(l'^2,.-.,a:„) Axn+i) 
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Comparing this relation with p.6p . we have h{f) < h{r) that contradicts 
the assumption that the tree F is not optimal. Consequently, the tree F is 
optimal and h{f) = 1 + h{f)/2. The induction step is proved analogously for 
other types of the function / listed in the lemma. D 

For arbitrary numbers x,a € E2, denote 




Proof of Proposition \3.1\ For an arbitrary natural number n and a number 
(7 e E2, set into correspondence to the function /(xi, . . . , x„) = cr a decision 
tree /l)(cr) that consists of a single terminal node labeled with a. One can see 
that the tree FQ{a) implements / and h{Fo{a)) = 0. Then TiB{n) =0. D 



Proof of Provosition \S.2l For an arbitrary natural number n, a natural num- 
ber i < n, and a number a € E2 set into correspondence to the function 
f{xi, . . . ,Xn) = x1 a. decision tree Fi{i,a). The decision tree Fi{i,a) con- 
sists of one nonterminal node v labeled with the attribute Xi and two termi- 
nal nodes wq and wi labeled with the numbers C^ and 1°^ respectively. For 
(5 = 0, 1, there is an edge leaving v and entering ws, and this edge is labeled 
with the number S. The tree Fi{i, a) does not contain other nodes and edges. 
One can see that the tree Fi{i,a) implements / and h{Fi{i,a)) — 1. 
On the other hand, a decision tree that implements a non-constant Boolean 
function must have at least two terminal nodes and, consequently, at least 
one nonterminal node. Therefore, Fi{i,a) is an optimal decision tree and 
Hein) = 1. D 



Proof of Proposition \3.3[ Any function of n arguments from the set 6*1 U ^3 U 
55 U ^6 U Pi U P3 U P5 U Pq up to argument names can be represented in the 
form /o(a;i,...,x„) = 0, /i(xi, . . . ,a;„) = 1, f^{xi, . . . ,Xn) = xi V . . . V Xf or 
fi{xi, . . . ,Xn) = xiA. . .Axt where t < n. Let us prove by induction on t that 
the relation h{f^) = h{ff) = 2 - l/2*~i holds ior t = 1, . . . ,n. li t = 1, then 
ff = ff = xi and h{fl) — h{f^) — 1. Let the relation be valid for each i less 
than t. From Lemma [5T71 it follows that h{f^) = 1 + h{fl_{)l2 and hif}) = 
1 -I- h{f^_^)/2. According to the inductive hypothesis, h{fl_^) = h{f^_^) = 
2-1/2*-^. Then h{fl) ^ h{f^) = l + (2-l/2*-2)/2 = 2-l/2*-i. One can see 
that maxjgji „} h{fl) — max^gji „} h{ff) ~ 2 — 2^^" and the maximum 
is reached on the functions /^ and /^ . Then validity of the proposition follows 
from the fact that for n > 1, each of the classes 5*1, ^a, 5*5, ^e, Pi, P3, P5, Pq 
contains at least one of the functions /^ and /^ . D 
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Proof of Proposition \3.4\ Each of the closed classes listed in the proposition 
contains at least one of the functions xi . . . a;„ and xi ® . . . a:„ ffi 1 . Using 
Lemma 13.61 we obtain that T(.B{n) > n. On the other hand, obviously, any 
Boolean function of n arguments can be implemented by a decision tree of 
the average depth at most n. Consequently, Tisin) — n. O 



Proof of Proposition \3.5l For n = 2fc + 1, both classes contain the function 
xi(B . . .(B Xn and validity of the proposition is proved analogously to Propo- 
sition [331 For n = 2k, none of the classes contain a function with n essential 
variables and both classes contain the function xi ® . . . ® Xn-i- Therefore, 
nsin) =n-l. D 



Proof of Proposition [jTm For n — 2fc + 1, the class contains the function xi © 
. . .0) Xn, and validity of the proposition is proved analogously to Proposition 
3.41 Let n = 2k, and / be the function such that h{f) = ^^(n). Let z = 
(f,xi, . . . ,Xn) be a problem corresponding to the function /. Consider a 
sequence of rows J°, . . . , J" G T^ where ^° = (0, . . . , 0, 0), J^ = (0, . . . , 0, 1), 
P = {Q,...,l, 1), . . ., ^" = (1, . . . , 1, 1). Note that f{5^) = and /((5") = 1 
because / e C4. Since n = 2k, the relation /(^*) = /(^*^^) holds for some 
i G {0,...,n — 1}. Thus the table Tj has a terminal separable subtable 
Tz{xo, 0)(a;i, 0) . . . (xi_i, 0)(a;i+i, 1) . . . (xn, 1) containing exactly two rows: 
6' and 6'+^. From part (a) of Lemma [S3] it follows that h{f) < n- 2^-"'. 
ThenHi3(n) <n~~2^-^. 

Consider the function f(x\, ...,a;„) ~ x\ l\ . . . l\Xn\l (x\ ... Xn). Let z 
be the problem corresponding to the function /. One can see that the table 
Tz has n terminal separable subtables /i,...,/„, Ii — Tz(x\,V) . . .(xi-\, 
V)(xij^\, 1) . . . (xm 1), containing two rows and does not have other terminal 
separable subtables containing more than one row. The subtables /i, . . . ,/„ 
have the common row (1,...,1). Thus any partition of the table T^ can 
contain only one of the subtables Ii, . . . , !„■ From part (b) of Lemma 13.51 it 
follows that h{f) = n-2i~". ThenHsH = ri-S^"". D 



Proof of Proposition \3.7\ For n = 2k + 1, both classes contain one of the 
functions xi . . . a;„, cci . . . a;„ 1 and validity of the proposition 
is proved analogously to Proposition 13.41 Let n = 2k. Since Di C C4 and 
-D3 C d, validity of the upper bound follows from Proposition [3^61 

Consider a function f(xi, . . . ,Xn) defined as follows: on a tuple S = 
{Si, . . . ,6n) it takes the value (^i . . . (5„) if the number of zeros in S 
is greater than the number of ones, the value ((5i . . . (5„ 1) if the number 
of zeros in S is less than the number of ones, and the value Si if the number of 
zeros is equal to the number of ones. Let us show that any terminal separable 
subtable / of the table T, contains at most one row in which the number 
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of zeros differs from tlie number of ones. Assume the contrary, i.e. the table 
/ contains at least two such rows. Let the number of zeros in the first row 
exceed the number of ones. If the second row possesses the same condition, 
then the subtable / contains two rows S^ and S^ in which the number of ze- 
ros is greater than the number of ones and which differ exactly in one digit. 
If the number of ones in the second row exceeds the number of zeros, then 
the subtable I contains two rows S^ and S^ with (n/2 — 1) and (n/2 + 1) 
ones respectively. According to the definition of the function, f{Si) ^ /(^2) 
for both cases that contradicts the assumption that / is a terminal subtable. 
The case when the first row contains more zeros than ones is considered anal- 
ogously. Therefore, each terminal separable subtable of the table T^ contains 
at most one row in which the number of zeros differs from the number of ones. 
It implies that any terminal separable subtable of Tz contains at most two 
rows and each two-row subtable contains a row with n/2 zeros and n/2 ones. 

n/2 

Obviously, the table T^ contains C„ such rows. Then any partition of the 

n/2 

table Tz contains at most Cn subtables with two rows and does not contain 
subtables with a greater number of rows. According to part (b) of Lemma 
[531 the relation h{f) > n- l/2''-^Cn^^ holds. Using the bound ^3\), we 
obtain 

nnin)>hif)>n-'-^^>n^'-^. 



Proof of Proposition \3.8[ It is not hard to show that each class contains 
the function T/ir„ rn/2l • Validity of the lower bound of the lemma imme- 
diately follows from Lemma 13.41 Let us prove validity of the upper bound. 
Let f{xi, . . . ,Xn) be a function for which the equality h{f) — Tisin) holds. 
Denote z — (/, si, . . . ,x„) the problem corresponding to the function /. 
Consider the value of the function f{d) on the tuple S = {Si, . . . ,(5„) where 
Si ^ 62 = ■ ■ ■ = S,n = 1, (5„,+i = S,n+2 = . . . = (5„ = 0, m = [n/2\. 
li f{S) = 1, then / takes the value 1 on all tuples in which the first to 
digits are set to 1. Then the table Tz has a terminal separable subtable 
Tz{xi, l)(x2, 1) . . . {x,n, 1), containing 2"-" = 2L("+i)/2J rows. From part (a) 
of Lemma 13.51 it follows that 

Hf) <n~l{n + 1)/2J 2-L("+i)/2J . (3.10) 

If f{S) = 0, then the function / takes the value on all tuples in which the 
last (n — to) digits are set to 0. Thus the table Tz has a terminal separable 
subtable Tz{xm+i,0){xm+2,0) ■ ■ ■ {xn, 0) containing 2™ = 2L"/2J rows. From 
part (a) of Lemma 13.51 it follows that 
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h{f) <n- [n/2j2-L"/2j (311) 

By taking the weakest bound of (|3.10p and (|3.1ip . we obtain 7iB(n) — h{f) < 
n-[n/2j2-L"/2J. □ 



Proof of Proposition \3.B\ It is easy to show that the class D2 contains the 
function Thrn^{n+i)/2 for an arbitrary odd n, and the function r/ir„_i „/2 
for an arbitrary even n. Then validity of the lower bound for an arbitrary 
n > 3 follows from Lemma 13.41 Validity of the lower bound for n = 1, 2 
can be proved by a direct check. The upper bound follows from the relation 
D2 C Ml and Proposition[33 D 



Proof of Proposition \3.1(A One can see that each function of n argu- 
ments from the set Ff° U F^ U F^ U F^ can be represented in the form 
/°(a;i,...,x„) = Xi\J 4P^^i{xi, ...,Xi_i,a;i+i,...,a;„) or /^(xi, . . . , a;„) = 
Xi A (t)\_i{xi, . . . , Xi-i,Xi+i, . . . , Xn)- If 4>n-i IS & Constant function for S — 
or S = 1, then h{f^) < 1. Let the function (t)n_i be non-constant. According 
to Lemma [3.71 /i(/,'^) = 1 + h{(f)^-^_^)/2. Since the function 4>^„_i has at most 
{n - 1) essential variables, /i((/>^_i) < n - 1 and h{f,{) < 1 + (n - l)/2. This 
relation holds for all functions and, consequently, TCsin) < {n + l)/2. 

For cr = 0, 1, consider the functions /°''^(xi, . . . , Xn) = xi V (0:2 ® . . . ® 
Xn ffi cr) and f^''^{xi, . . . , Xn) = xi A {x2 (B ■ ■ ■ (B Xn (B cr). One can see that 
for an arbitrary natural n, each of the classes F^,F^,F^,F^ contains at 
least one of these functions. According to Lemma [321 h{f^''^) — /i(/,j''^) = 
l + h{x2®...®Xn®cr)/2^ {n + l)/2.ThenHB{n) = {n + l)/2. D 

Proof of Proposition \S.11\ One can see that each class contains one of the 
functions 

/l = Thrn-l,\{n-l)/2\{xi, . ■ . ,Xn-l) A Xn , 
/2 = Thrn-I,[(n-1)/2]{X1, ■ ■ . ,Xn-l) \/ Xn 

for every n > 1. According to Lemma 13.41 ^(r/ir„_x.r(n-i)/2l) ^ n — ^Jn. 
According to Lemma f3. 71 /i(/i) = hi^f-i) > 1 -I- (n — ^Jn)/2. Validity of the 
lower bound for n — 1 can be proved by a direct check. Validity of the 
upper bound follows from the relations ^2°° C Ff°, ^3°° C ^4°°, F^ C F^ 
Ff° C ^8°° and Proposition [SUDl D 

Proof of Proposition \S.1'A Validity of the lower bound follows from the 
relations Ff D Ff°, F^ D F^ , F^ D Fg^, F^ ^ F^, fi>2 and Proposition 

Eini 
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Let B — F4, f{xi, . . . ,Xn) be a function such that h{f) = 7YB(n). Let 
z = (f,xi, . . . ,Xn) be the problem corresponding to /. Since F^ C C2, the 
relation /(I, . . . , 1) = 1 holds. The lower bound implies that / differs from 
the constant 1. Then there exists a number k, 1 < k < n such that / takes 
the value 1 on all tuples containing less than k zeros, and there exists a tuple 
containing exactly k zeros on which the function takes the value 0. Then 
the table Tz has a terminal separable subtable Tz{xi, l)(a;2, 1) . . . (xn-k+i, 1) 
that contains 2*^^^ rows. According to part (a) of Lemma [X5l the inequality 

h{f)<n^'^ (3.12) 

holds. Without loss of generality, assume that the function takes the value 
on the tuple S — ((5i, . . . , S„) in which di — 62 — ■ ■ ■ = Sk = and Sk+i = 
5k+2 — ■ ■ ■ — Sn = i- Then in each tuple 6 — {Si, ... , (5„) such that f{6) = 0, 
at least one of the first k digits 61, ... ,6k is set to zero. Therefore, the table T^ 
has a terminal separable subtable Tz(xi, l)(a;2, 1) . . . (xk, 1) containing 2"^'' 
rows. According to part (a) of Lemma 13.51 the inequality 

hU)<n-'^ (3.13) 

holds. 

The weakest of the bounds p.l2p and ()3.13p reaches the maximum on 
k = l{n + l)/2\. Consequently, Hs(n) < n- [n/2j2-L«/2j xhc upper bound 
for B — F^ is proved analogously. The upper bound for the remaining classes 
follows from the relations F^ C F^, F^ C F^, F^ C F^, F^ C F| that are 
valid for any natural /i > 2. D 



Proof of Proposition \3.1SX Validity of the lower bound follows from the 

pM ^ poo pP ^ poo pM ^ poo pM -, poc 



relations F^ D F!^ , F^ D F^ , F^ D F^ , F^ ^ F^ , fi > 2 and Proposition 



ins 

Validity of the upper bound follows from the relations i^f C F^ , F/^ C F^ , 
K ^ P5 P7 ^ -P's'' /^ > 2 and Proposition [HI D 

3.2 On Branching Programs with Minimum Average 
Depth 

This section considers a possibility of joint optimization of time and space 
complexity. For this purpose, a decision tree is represented in a compact form 
named branching program. According to Theorem 13.41 the requirement to a 
branching program to have the minimum average weighted depth is rather 
strong, since all branching programs with the minimum average weighted 
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depth are read-once. This fact reveals a contradiction between time and space 
complexity requirements, because many problems have high lower bounds on 
the number of nodes in a read-once branching program. The section con- 
cludes with description of several problems, for which the number of nodes 
in a branching program with the minimum average weighted depth grows 
exponentially with the number of attributes. 

Let U = {A, F) be a 2-valued information system, and !?■ a weight function 
for U. A branching program for the problem z = {v, fi, . . . , /„) over [/ is a 
finite oriented acyclic graph in which: 

a) at least one edge enters each node except one called the root of the branch- 
ing program] 

b) each terminal node (a node that does not have outgoing edges) is labeled 
with a number from lj. 

c) two edges leave each nonterminal node, labeled with the numbers and 
1 respectively; 

d) each nonterminal node is assigned with an attribute from the set {/i, . . . , 

/n}- 

A path from the root to a terminal node is called complete. A branching 
program is called read- once if in each complete path, all nonterminal nodes 
are assigned with pairwise different attributes. 

Let G be a branching program for the problem z. For an arbitrary com- 
plete path ^ in G, let us define the subtable TzTt{S) of the table T^ and the 
path weight '1^{£,) in the same way as it is defined for decision trees. For 
an arbitrary row d G T^, denote by ^"^ the complete path in G such that 
d G TzTt{^'^). We will say that a branching program G solves the problem z 
if for each row d G Tz, the terminal node of the path ^'^ is labeled with the 
number i>{d). Let P be a probability distribution for the problem z. The value 
h^{G,P,z) = EdsT, 1'{^'^)P{d)/N{Tz,P) is called P-average weighted depth 
of the branching program G. A branching program G for the problem z that 
solves z and has the minimum P-average weighted depth is called optimal 
for W, z and P. 

Theorem 3.4. Let U be a 2-valued information system, ^ a weight function 
for U , z a problem over U , and P a probability distribution for z. Let G be a 
branching program for z that solves z and is optimal for ^ , z and P. Then 
G is a read- once branching program. 

Proof. For an arbitrary node w, we will call v-subprogram of the branching 
program G the set of nodes and edges from G to which an oriented path from 
V exists. Let w be a node such that for each node w of u-subprogram, each 
path from the root of G to w contains v. Let v have k > 1 incoming edges 



60 3 Representing Boolean Functions by Decision Trees 

ri, . . . , rfc. For i = 2, . . . , fc, let us add to the program G a subprogram Gi 
that coincide to w-subprogram and transform G so that the edge r^ enters the 
root of the subprogram Gi . Let us repeat this transformation until at most 
one edge enters each node in G. Denote the resulted graph F. One can see 
that P is a decision tree for the problem z that solves z and is optimal for 
If", z and P. 

Assume G is not a read-once branching program. Then there is a complete 
path ^ in r containing two nonterminal nodes vi and V2 which are assigned 
with the same attribute /. Let vi precede V2 in the path ^. Denote e the edge 
that leaves vi and is contained in the path ^, and a the number assigned to 
e. Denote ^2 the path from the root of F to the node U2- One can see that 
either T^tt{£_2) = or T^TT{^2){f, S) = T^tt{£_2) for some S ^ a. Then the node 
V2 is not essential and the tree F is not reduced. According to Lemma [2.31 
the tree F is not optimal for !?', z and P. Then the branching program G is 
not optimal for ^, z and P which contradicts the premise of the theorem and 
thus concludes the proof. D 

Let us conclude with some examples of problems for which the minimum num- 
ber of nodes in the branching program with the minimum average weighted 
depth grows exponentially with the number of attributes. For an arbitrary 
Boolean function f{xi, . . . , a;„), we will say that a branching program imple- 
ments f if it solves the problem z — {f,xi, . . . , a;„). 

In 66^ , it is shown that a read-once branching program implementing the 
function Mult : {0, 1}^" -^ {0, 1} (the middle bit in the product of two 
n-bit integers) contains at least 2^^*^^") nodes. In [83l |84j |85], a function 
n/2 — Glique — Only : {0, 1}" -^ {0, 1} is considered that takes as input an 
incidence matrix for a graph with n nodes. The function takes the value 1 if 
and only if the graph contains a n/2-clique and does not contain other edges. 
It is shown that a read-once branching program implementing the function 
n/2 — Clique— Only contains at least 2*^^"^ nodes. Note that there is a branch- 
ing program implementing n/2 — Glique — Only such that it has 0{n^) nodes, 
and any attribute appears at most twice in each complete path. In [55], it 
is shown that a read-once branching program implementing the characteris- 
tic function of Bose-Chaudhuri codes contains at least exp{n{y^/2)) nodes. 
Theorem 13 .41 shows that the branching programs that are optimal relative to 
the average weighted depth have the same or greater number of nodes than 
the read-once branching programs with the minimum number of nodes. 



Chapter 4 

Algorithms for Decision Tree 

Construction 



The study of algorithms for decision tree construction was initiated in 1960s. 
The first algorithms are based on the separation heuristic [13 HI] that at 
each step tries dividing the set of objects as evenly as possible. Later Garey 
and Graham [1^ showed that such algorithm may construct decision trees 
whose average depth is arbitrarily far from the minimum. Hyafil and Rivest in 
[55] proved iVP-hardness of DT problem that is constructing a tree with the 
minimum average depth for a diagnostic problem over 2-valued information 
system and uniform probability distribution. Gox et al. in [22j showed that 
for a two-class problem over information system, even finding the root node 
attribute for an optimal tree is an A^P-hard problem. 

Several exact algorithms of decision tree construction are known but, as 
could be expected, none of them have polynomial time complexity in gen- 
eral case. The algorithms based on dynamic programming [371 EHl ES] build 
decision tree bottom-up by synthesizing a tree for a table from trees for its 
separable subtables. The algorithms based on branch-and-bound technique 
perform depth- first search in the space of possible tree prefixes [S] [73] . The 
second method is more complex from the computational point of view, but it 
can serve as a base for approximation algorithms that use heuristics to guide 
search. A combination of the two approaches is described in [42] . There are 
also algorithms that use logic methods to analyze the function being imple- 
mented like finding function implicants [11| or T-terms 82 . A comprehensive 
survey of the algorithms can be found in ^J . 

Most of approximate algorithms for decision tree construction are greedy. 
These algorithms construct trees in a top-down fashion by minimizing some 
data impurity function at each step. Activity of a variable [13], entropy 
[701 I7H] and Gini index [5] are widely used as data impurity functions. For 
some problems, a detailed analysis of existence of algorithms with a guar- 
anteed approximation ratio has been performed. Adler and Heeringa |32j 
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proved absence of polynomial-time approximation scheme for DT problem 
unless P = NP and described an algorithm that has (Inn + 1) approxima- 
tion ratio. Chakaravarthy et al. JJJ generalized the results to k-DT that is 
construction of a decision tree with the minimum average depth for a diag- 
nostic problem over fc- valued information system and an arbitrary probability 
distribution. They proved iVP-hardness of i^(log n) approximation and de- 
scribed an algorithm that has O(logfclogn) approximation ratio. A similar 
problem, building a tree with the minimum average depth for a binary classi- 
fication problem over a 2-valued information system and uniform probability 
distribution, is surprisingly harder. In |i32j . an approximation-preserving re- 
duction of the problem to ConDT is done that is building the minimum size 
tree for a binary classification problem over 2-valued information system. For 
the latter problem, Alekhnovich et al. [3] proved absence of polynomial time 
c In n-approximation for any constant c unless NP C DTIAIE[2"^ ] for some 
e< 1. 

The chapter is devoted to theoretical and experimental study of several 
exact and approximate algorithms for decision tree construction. It consists 
of four sections. The first section describes an algorithm A based on dynamic 
programming. The idea is close to 42 , but it was devised by the author 
independently in collaboration with Dr. Moshkov. The algorithm takes as 
input a decision table and finds the set of all so-called irredundant decision 
trees that have the minimum average weighted depth. The second section ex- 
perimentally estimates the approximation ratio of several greedy algorithms 
on data sets from UCI Machine Learning Repository [5S]. The third section 
describes using A for calculating exact values of the Shannon type function 
H{n) for the class of monotone Boolean functions for small n. The fourth 
section contains experimental results of applying A for building an optimal 
tree for corner point detection '74|, a technique used in computer vision to 
track objects. 

Some results of this chapter have been published in [3D1 [211 El] . 

4.1 Algorithm A. for Decision Tree Construction 

In this section, an algorithm is considered that builds an optimal decision 
tree with the minimum average weighted depth for a problem represented in 
the form of decision table. The idea of the algorithm is based on dynamic 
programming [27 l l42 l [60 l [76] . 
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4-1 -1 Representation of Set of Irredundant Decision 
Trees 

Let U — {A, F) be an information system, and z — {v^ Ji, . . . , /„) a problem 
over U. Let T be a separable subtable of T^. For i e {1, . . . , n}, denote E{T, i) 
the set of numbers contained in the i-th column of the table T, and denote 
E{T)^{i:ie{l,...MAE{T,i)\>2}. 

Among decision trees for the problem z that solve z we distinguish irre- 
dundant decision trees. Consider an arbitrary node w of the tree F and its 
corresponding separable subtable T = TzT:{path{F,w)). Let T be a terminal 
subtable, and i'{x) = r on the set of rows of the table T for some r E uj. Then 
w is a terminal node labeled with r. Let T be a nonterminal subtable. Then 
w is labeled with an attribute fi where i £ E{F). Finally, each node w such 
that FzTT{path{F, w)) = is labeled with the number 0. 

The following proposition shows that among irredundant decision trees, 
at least one has the minimum average weighted depth. 

Proposition 4.1. Let U be an information system, ^ a weight function for 
U, z a problem over U , and P a probability distribution for z. Then there 
exists an irredundant decision tree that is optimal for W, z and P. 

Proof. Let F he a decision tree for the problem z that solves z, and F be 
optimal for >F, z and P. Let us consider an algorithm that transforms F into 
an irredundant decision tree. The algorithm sequentially processes all nodes 
of the tree F. Let w be the current node. Denote T — TzTT{path{w)). The 
algorithm tries to apply the following rules to each node. 

• If T == 0, then replace the subtree whose root is w with a single node 
labeled with 0; 

• If T is a terminal subtable and v{x) = r on the set of rows of the table 
T, then replace the subtree whose root is w with a single node labeled 
with r; 

• Let T be a nonterminal subtable and w be labeled with an attribute fi, 
i i E{T). Then E{T, i) = {5} for some 5 e Ek- Denote by F5 the decision 
tree whose root the edge leaving w and labeled with 5 enters. Then replace 
the subtree whose root is w with F^. 

Since each node is considered at most once, the algorithm ends in a finite 
number of steps. Denote the resulted decision tree by F. One can see that F is 
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an irrcdundant decision tree for T. Obviously, the applied transformation does 
not increase the complexity and thus F remains optimal. D 

Let T be a separable subtable of Tj. Let us define a problem zt corre- 
sponding to the table T. Let T^ — {di,...,Js} and Qi,...,Qs G A be 
equivalence classes such that {f i{qi) ■,■■■■, fn(<li)) = di for any qi G Qi- Let 
T = {dii, . . . ,di^}. Then zt is the problem with the same description as z 
over the information system {Qi-^ U . . . U Qi^,F). Note that zt^ is the initial 
problem. 

Denote by Tree*{T) the set of irrcdundant decision trees for the problem 
Zt- Assume technically that for T = 0, the set Tree*{T) contains a single 
tree that is a node labeled with the number 0. Consider an algorithm B 
for construction of the graph A{z), which represents in some sense the set 
Tree*{Tz). Nodes of this graph are some separable subtables of the table T^. 
During each step, the algorithm processes exactly one node and marks this 
node with the symbol *. The algorithm starts with the graph which consists 
of one node T^, and finishes when all nodes of the graph are processed. 

Let the algorithm have performed^ steps. Describe the step (p+1). If in the 
considered graph all nodes have already been processed, then the algorithm 
finishes, and the considered graph is A{z). Let the graph have an unprocessed 
node (table) T. If T is a terminal subtable and iy{x) = r on the set of rows of 
the table T, then label the considered node with the number r, mark it with 
the symbol * and pass to the step (p + 2). 

Let T be a nonterminal subtable. For each i E E{T), draw from the node 
T a bundle of edges. Let E{T, i) — {Si, . . . , St}- Then draw t edges from T, 
and label these edges with the pairs (/i, (5i), . . . , (/i, St) respectively. These 
edges enter the nodes T{fi, Si),. . . , T{fi, St). If some of these nodes are not 
in the graph, then add these nodes to the graph. The algorithm marks the 
node T with the symbol * and proceeds to the step {p + 2). 








Fig. 4.1 Trivial Fig. 4.2 Aggregated decision tree 

decision tree 
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Now for each node of tlie graph A{z), we describe the set of decision trees 
corresponding to it. It is clear that A{z) is a directed acychc graph. A graph 
node is called terminal if it does not have outgoing edges. We will "move" 
from terminal nodes which are labeled with numbers to the node T^. Let T 
be a node which is labeled with the number r. Then the only trivial decision 
tree depicted on Fig. 14.11 corresponds to the considered node. Let T be a 
node (table), such that i'{x) ^ const on the set of rows of T. Let i G E{T), 
E{T,i) = {5i,...,(5J, &ndEk\E{T,i) = {71, . . . ,7fe_J. Let ri,...,rt be 
decision trees from the sets corresponding to the nodes T{fi, 61), . . . , T{fi, St). 
Then the decision tree depicted on Fig. 14.21 belongs to the set of decision 
trees, which corresponds to the node T. All such decision trees belong to the 
considered set. This set does not contain any other decision trees. 

For any node T, denote by Tree{T) the set of decision trees corresponding 
to T described by the graph A{z). The following proposition shows that A{z) 
represents all irredundant decision trees for the problem z. 

Proposition 4.2. Let U be an information system, and z a problem over U . 
Let T be a node in the graph '^{z). Then Tree{T) = Tree*{T). 

Proof. Prove the proposition by induction on the nodes of the graph A{z) . For 
each terminal node T, there is only one irredundant decision tree depicted on 
Fig. 14. II and the set TreeiT) contains only this tree. Let T be a nonterminal 
node and the proposition hold for all its descendants. Consider an arbitrary 
decision tree F g Tree(T). Obviously, F contains more than one node. Let 
the root of F be labeled with an attribute fi. For each 6 G Ek, denote by Fs 
the decision tree connected to the root of F with the edge labeled with the 
number 6. From the definition of the set Tree{T) it follows that i is contained 
in the set E(T); for each S E E{T,i), the decision tree Fs belongs to the set 
Tree{T{fi, S)); and for each S ^ E{T, i), the decision tree Fg is a single node 
labeled with the number 0. According to the induction base, the tree Fs is 
an irredundant decision tree for the problem z^^f^^s)- Then the tree F is an 
irredundant decision tree for the table zt, so Tree{T) C Tree*{T). 

Now consider an arbitrary irredundant decision tree F for the problem 
zt. According to the definition of irredundant tree, the root of F is labeled 
with an attribute fi-, i € E{T), and the subtrees whose roots are nodes in the 
second floor are irredundant decision trees for the corresponding descendants 
of the node T. Then according to the definition of the set Tree{T), the tree 
F belongs to Tree{T), and TreeiT) = Tree*{T). D 

The following proposition gives upper and lower bounds on the time com- 
plexity of the algorithm B (further we assume that k is fixed and do not study 
dependence of the algorithm time complexity on k). 
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Proposition 4.3. For an arbitrary problem z ~ (i^, /i, . . . , /„) represented 
in the form of decision table T* , the working time of the algorithm B is 
proportional to the number of rows D{Tz) ifT^ is a terminal table. IfTz is a 
nonterminal table, the working time of the algorithm B is bounded from below 
by the maximum of the values n, \S{z)\, D{Tz) and is bounded from above by 
a polynomial on these values. 

Proof. At start, the algorithm B reads D{Tz) values of i'{x) function to check 
if Tz is a terminal table. If the table T^ is terminal, the algorithm builds 
the graph A{z) that consist of a single node and finishes, so the statement 
obviously hold. 

Let T* be a nonterminal table. From the definition of a problem over 
information system it follows that \E{Tz)\ = n, so the algorithm builds n 
bundles of edges leaving the root. One can see that the algorithm B performs 
at least |S'(z)| steps, so the lower bound holds. 

The number of steps of the algorithm B is limited from above by the 
number of nonterminal subtables and their immediate descendants in the 
graph A{z) that is at most \S{z)\nk. For a table T, construction of the sets 
E{T) and E{T, i) takes a linear time on the length of the table representation, 
i.e. nD(T). It is easy to implement a procedure which given a subtable T 
checks if the corresponding node presents in the graph, and has a polynomial 
time complexity on |S'(z)| and D{T). While processing a nonterminal table 
T, the algorithm needs to build a set of subtables of the form T(/i, a) that 
can be done in a polynomial time on n and D(T). 

Then the total working time of the algorithm is bounded from above by a 
polynomial on I S'(z) I, n and D(r). D 

4- 1-2 Procedure of Optimization 

Let us describe a procedure which transforms the graph A{z) into its proper 
subgraph A,p^p{z). It begins from the terminal nodes and moves to the node 
Tz ■ The procedure assigns a number to each node and possibly removes some 
bundles of edges which start in the considered node. First, the number is 
assigned to each terminal node. Consider a node T which is not terminal and 
a bundle of edges which starts in this node. Let the edges be labeled with 
pairs (/i,(5i), . . . , (/i,(5t), and they enter the nodes T{fi,Si), . . . ,T{fi,St) to 
which numbers pi, . . . ,pt have been already assigned. Then assign the number 
^{fi)N{T, P) + J2*j=iPj to the considered bundle. 

Let p be the minimum of the numbers assigned to the bundles starting in 
T. The procedure assigns p to the node T and removes the bundles starting 
in T which are assigned with numbers greater than p. After all nodes are 
processed, the procedure removes from the graph all nodes such that there 
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is no directed path from the node T^ to the considered node. Denote the 
resulted graph by A,p^p{z). 

As each nonterminal node keeps at least one bundle of the outgoing edges, 
all terminal nodes of Axp^p{z) are terminal nodes of the graph A{z). As it 
was described earlier, we can set to correspondence a set of decision trees 
Tree<f^p{T) to each node T of Axi,^p{z). One can see that all these deci- 
sion trees belong to the set Tree{T). Denote by Tree^ p{T) the subset of 
Tree{T) containing all decision trees that are optimal relative to the av- 
erage weighted depth, i.e. Tree^p{T) = {P & Tree{T) : h^{r,ZT,P) = 
imiip^j'rf,e{T) h^{r, zt, P)}- The following theorem shows that the optimiza- 
tion procedure removes all and only non-optimal decision trees. 

Theorem 4.1. Let U be an information system, ^ a weight function, z a 
problem over U , and P a probability distribution for z. Let T be an arbitrary 
node in the graph A{z). Then Treeif_p{T) =^ Tree^ p{T). 

We preface proof of the theorem by the following lemma. 

Lemma 4.1. Let U be an information system, ^ a weight function, z a prob- 
lem over U , and P a probability distribution for z. Let T be an arbitrary node 
in the graph A{z), and p the number assigned to T by the optimization pro- 
cedure. Then for each decision tree F from the set Tree,fp{T), the equality 
p = N(T, P)h^{r, ZT, P) holds. 

Proof. Prove the lemma by induction on the nodes of A{z). For each terminal 
node T, only one irredundant decision tree F exists depicted on Fig. 14.11 and 
the statement of the lemma obviously holds for T. Let now T be a nonterminal 
node and the statement of lemma holds for all descendants of T. Consider 
an arbitrary decision tree F G Tree^^p{T). Let the root of F be labeled with 
an attribute fi. Let E{T, i) — {ai, . . . , a*}. For j = 1, . . . , i, denote by Fj the 
decision tree connected to the root of F with the edge labeled with Oj . Let for 
j = 1, . . . , t, the node T{fi, aj) be labeled with a number pj. For j = 1, . . . , i, 
denote Zj = ZT(f^,a,)- 

The induction base implies that the equality pj — N{T{fi,aj),P)h^{Fj, 
Zj, P) holds for j = 1, . . . , i. According to the definition of the optimization 
procedure, p = ^{fi)N{T,P) + J2j=iPj- Since F is an irredundant decision 
tree, for any S ^ E{T,i), the edge that leaves the root of F and is labeled 
with S, enters a terminal node. 

From the definition of the average weighted depth we have 
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From the last three equalities we have p = N{T,P)h,p{r, zt,P)- Since F is 
an arbitrary tree from Tree^^p{T), all the trees in Tree^^p{T) have the same 
complexity equal to p. D 



Proof of Theorem \4-l\ The theorem will be proved by induction on the 
nodes of the graph A{z). Let T be a terminal node. Then the set Tree{T) 
contains only the tree depicted on Fig. 14.11 and this tree is not removed by 
the optimization procedure. Then the statement of the theorem holds for the 
node T. 

Let now T be a nonterminal node in A{z)^ and the statement of the the- 
orem hold for any descendant of T in the graph A{z). Let the optimization 
procedure assigned a number p to the node T. Lemma 14.11 implies that all 
decision trees in Tree^^p{T) have the same complexity p. Consider an arbi- 
trary decision tree P from the set Tree^ p{T)- From the definition of the set 
Tree^ p{T) we have 

N{T,P)h^{P,ZT,P)<p. (4.2) 

Let us show that A^(r, P)h,p{P, zt,P) — p. Let the root of P be assigned 
with an attribute fi. Since P is an irredundant decision tree, i € E{T). 
Let E{T,i) = {ai, . . . ,at}. For j = l,...,i, denote by Pj the subtree 
that is connected to the root with the edge labeled with aj. One can see 
that Pj is contained in the set Tree(T{fi,aj)). Let pj be the number as- 
signed to the node T{fi,aj) during optimization. For j — l,...,t, denote 
Zj = zj'(f-^a)- Since the theorem holds for the node T{fi,aj), we have 
N{T{fi,aj),P)h^{Pj, Zj,P) > Pj. From the description of the optimization 
process it follows that 'F{fi)N{T, P)+^j^i Pj > P- Since P is an irredundant 
decision tree, for any S ^ E{T,i), the edge that leaves the root of P and is 
labeled with 6, enters a terminal node. 

From the two last equalities and (|4.ip we have N{T, P)h4,{P, zt,P) > p, 
and, recalling g^, N{T,P)h^{P, zt,P) = p. Then 

Treeq,^p{T) C Tree^^p{T) . (4.3) 

Due to ()4.H) , optimality of the tree P implies optimality of each tree Pj , 
so Pj e Tree^ p{T{fi, aj)) for j — 1, . . . ,t. Then, according to the induction 
base, Pj belongs to the set Tree4,^p{T{fi, Oj)) ior j — 1,. . . , t. Consider the 
bundle of edges in the graph A{z) that leave the node T and are labeled with 
the pairs {fi,ai), . . . ,{fi,at). Since N{T,P)h,p{P,ZT,P) = p, these edges 
were not removed by the optimization procedure. Then, according to the 
definition of the set Tree^^p{T), the tree P belongs to this set. As P was 
chosen arbitrarily, we have Tree^ p{T) C Tree,p,p{T), and due to (I4.3p . 
Tree^p{T)=Tree^^p{T). ' D 
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Consider an algorithm A that given a table T* first builds a graph A{z), 
then transforms it resulting graph Aip^p{z) and finally extracts one of trees 
described by the graph A,p^p{z). For an arbitrary polynomial Q, a probability 
distribution P is called Q-restricted if for an arbitrary row d € Tz, the length 
of the binary notation of the number P{d) does not exceed Q{n) where n is 
the number of columns in the table. The following theorem characterizes the 
time complexity of the algorithm A. 

Theorem 4.2. Let Q{x) he some polynomial. Then for an arbitrary problem 
z = (i^, /i, . . . , /„) and an arbitrary Q-restricted probability distribution P 
for the problem z, the working time of the algorithm A is proportional to 
the number of rows D{Tz) if the table T* is terminal. If the table T* is 
nonterminal, the working time of the algorithm A is bounded from below by 
the maximum of the values n, the number of nonterminal separable subtables 
\S{z)\, D{Tz) and the maximum length of attribute weight in binary notation, 
and is bounded from above by a polynomial on these values. 

Proof. If the table T^ is terminal, the working time of the algorithm B is 
proportional to D{Tz) according to Proposition 14.31 Since the graph A{z) 
contains a single node, the remaining steps of the algorithm A are completed 
in a constant time, so the statement of the theorem holds. 

Let T* be a nonterminal table. While calculating the number to assign to 
the node Tz in the graph A{z), the optimization procedure necessarily reads 
the weights of all attributes. This fact and Proposition 14.31 prove the lower 
bound on the working time of the algorithm A. 

Let us prove the upper bound. From Proposition 14.31 it follows that the 
working time of the algorithm B is limited from above by a polynomial on 
D{Tz), n and |S'(z)|. The optimization procedure performs exactly (|S'(z)| + l) 
steps. The time of computing pi is limited from above by a polynomial on 
the maximum length of the attribute weight notation (denote it by I), Q{n) 
and D{T). The time of computing p given pi is proportional to the number of 
bundles that is at most n. Given the graph A^p^p{z), an optimal tree can be 
obtained by the time proportional to the number of nodes in the graph, which 
is limited from above by a polynomial on n and |S'(z)|. Then the theorem 
statement is a consequence of the facts that both the number of steps and 
the complexity of each step are bounded from above by a polynomial on n, 
\S{z)lD{Tz)s.iiAl. D 



4.2 Greedy Algorithms 

A greedy algorithm builds a decision tree in a top-down fashion, minimiz- 
ing some impurity criteria at each step. There are several impurity criteria 
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based on information-theoretical [711 ES|i statistical [Hj and combinatorial 
|43j approaches. In this section, several impurity criteria are defined followed 
by a general description of greedy algorithm and experimental results that 
compare the average depth of decision trees built by different algorithms. 

Let U = {A, F) be an information system, if" a weight function for U, 
z = (i^, /i, . . . , /„) a problem over an U, and P a probability distribution for 
z. 

Let T be a separable subtable of T^. Let the function i^{x) take I dif- 
ferent values i'i,...,i'i on the rows of T. For i = l,...,l, denote Ni = 
S(JGT!y((J)=i/i ^i^)- We consider four uncertainty measures: 

• entropy: ent{T) = -Y.\^^{N,/N{T,P)\og2{N^/N{T,P))) (we assume 
01og2 = 0); 

• Gini index: gini{T) = 1 - ^'^^ {N,/N{T, P)f ; 

• misclassification error: me{T) = 1 ~ maXi=i^,,,jNi/N{T, P); 

• weighted number of unordered pairs of rows labeled with different deci- 
sions: rt{T) = (A^(r, P)2 - X; ■=! ^f ) /2 (note that rt(T) = N{T,Pf 

X gini{T)/2); 

Let i £ E{T) and E{T,i) — {ai,. . . ,at}. The attribute fi divides the ta- 
ble T into the subtables Ti = T(/„ oi), . . . ,Tt = T{f,,at). We now de- 
fine an impurity function I which assigns impurity I(T, fi) to this parti- 
tion. Let us fix an uncertainty measure U from the set {ent,gini^me,rt} 
and the type of impurity function: sum or weighted- sum. Then for the type 
sum, I{T,fi) — '^j^iU{Tj), and for the type weighted-sum, I{T,fi) = 
I]*=i U{Tj)N{Tj)/N{T). As a result, we have eight different impurity func- 
tions. 

Consider an algorithm G that given representation of a problem z and a 
probability distribution P in the form of decision table T* builds a decision 
tree G{z,P). 

Step 1. Assume T = T^. Build a decision tree that contains a single node v. 
Let T be a terminal table. Then assign the number ^{6) to the node v where 
6 is an arbitrary row from T. Denote G{z,P) the resulted decision tree. The 
process G is completed. 

Let T be a nonterminal table. Assign the word A to the node v and proceed 
to the next step. 

Let t > 1 steps have been already done. Denote F the tree built at the 
step t. 

Step (t + 1). If none of the nodes in F is assigned with a word from ]?*, then 
denote G{z, P) the tree F. The process F is completed. Otherwise, choose in 
F a node w which is assigned with a word a from 12* . 
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Let Ta be a terminal subtable. If Ta = 0, then instead of a assign to 
w the number 0. If Ta ^ 0, then instead of a assign to w the number v{5) 
where 5 is an arbitrary row from Ta. Proceed to the step (i + 2). 

Let Ta be a nonterminal subtable. Then for each fi G E{Ta), compute 
the value I{Ta, fi) and assign to the node w the attribute /i^ where io is the 
minimum i G {1, . . . , n} for which I{T, fi) has the minimum value. For each 
5 G E{Ta, fig), add to the tree T the node w{5), mark this node with the 
word a(/io,(5), draw the edge from w to w{S), and mark this edge with 6. 
Proceed to the step {t + 2). 

Different impurity functions result in different greedy algorithms. The fol- 
lowing experiment compares the average depth of decision trees built by these 
algorithms with the minimum average depth calculated by the algorithm A. 

Table 4.1 Average depth of decision trees built by different algorithms 



Data 

set 


Min. avg. 
depth 


sum 


weighted sum 


cnt 


gini 


rt 


me 


cnt 


gini 


rt 


me 


adult-stretch 


1.50 


1.50 


1.50 


3.50 


1.50 


1.50 


1.50 


3.50 


1.50 


agaricus-lepiota 


1.52 


2.35 


2.35 


1.54 


1.52 


1.52 


1.52 


1.52 


1.98 


balance-scale 


3.55 


3.55 


3.55 


3.61 


3.55 


3.55 


3.55 


3.61 


3.55 


breast-cancer 


3.24 


6.36 


6.36 


4.06 


3.30 


3.49 


3.70 


3.30 


3.35 


cars 


2.95 


3.06 


3.06 


3.72 


3.76 


2.95 


2.96 


4.00 


4.39 


flags 


2.72 


9.31 


9.73 


3.21 


2.81 


3.16 


3.16 


2.82 


2.80 


hayes-roth-data 


2.62 


2.64 


2.64 


2.64 


2.62 


2.64 


2.64 


2.62 


2.62 


house- votes-84 


3.54 


5.88 


6.99 


5.29 


3.77 


3.68 


3.80 


3.77 


3.63 


lenses 


1.80 


1.80 


1.80 


3.00 


3.00 


3.00 


1.80 


3.00 


3.00 


lymphography 


2.67 


7.09 


7.09 


3.37 


2.83 


3.12 


3.12 


2.79 


2.78 


monks- 1-tcst 


2.50 


4.50 


4.50 


2.50 


2.50 


2.50 


2.50 


2.50 


2.50 


monks- 1-train 


2.53 


4.34 


4.34 


2.53 


2.77 


3.19 


3.22 


2.53 


2.53 


monks-2-test 


5.30 


5.33 


5.33 


5.37 


5.54 


5.40 


5.40 


5.54 


5.54 


monks-2-train 


4.11 


4.70 


4.70 


4.54 


4.20 


4.34 


4.34 


4.26 


4.28 


monks-3-test 


1.83 


4.11 


2.78 


2.78 


1.83 


1.83 


2.08 


1.83 


1.83 


monks-3-train 


2.51 


3.76 


3.03 


2.71 


2.53 


2.54 


2.54 


2.53 


2.53 


nursery 


3.45 


4.05 


4.21 


3.76 


3.76 


3.48 


3.46 


3.85 


4.18 


poker- hand-train 


4.09 


6.54 


6.54 


4.66 


4.12 


4.12 


4.12 


4.12 


4.13 


shuttle-landing 


2.33 


3.93 


3.93 


2.93 


2.33 


2.40 


2.40 


2.33 


2.33 


soybean-small 


1.34 


1.34 


1.34 


1.34 


1.34 


1.34 


1.34 


1.34 


1.89 


spcct-test 


2.95 


5.93 


5.55 


4.93 


3.48 


3.04 


3.34 


3.47 


3.44 


teeth 


2.78 


4.39 


4.52 


2.78 


2.83 


2.83 


2.78 


2.83 


2.83 


tic-tac-toe 


4.35 


4.88 


4.68 


4.82 


4.94 


4.60 


4.58 


5.03 


5.11 


zoo-data 


2.29 


3.86 


3.86 


2.44 


2.37 


2.37 


2.37 


2.37 


2.41 


ARD 


0.564 


0.539 


0.222 


0.070 


0.066 


0.052 


0.126 


0.121 
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The data sets were taken from UCI Machine Learning Repository j25) . 
Each data set is represented as a table containing several input columns and 
an output (decision) column. Some data sets contain index columns that 
take unique value for each row. Such columns were removed. Some tables 
contain rows with identical values in all columns, possibly, except the decision 
column. In this case, each group of identical rows was replaced with a single 
row with common values in all input columns and the most common value in 
the decision column. Some tables contains missing values. Each missing value 
was replaced with the most common value in the corresponding column. 

The resulted table was interpreted as a decision table T* where in- 
put columns represent attribute values and the output column represents 
values of the function ^{x). We assume uniform probability distribution 
P{x) = 1. As an integral performance measure we consider the average rel- 
ative deviation(ARD). For an approximate algorithm X, a set of problems 
Z ~ {zi,. . . ,zt}, and a set of probability distributions V = {Pi, . . . ,Pt}, 

where hx{zi,Pi) is the P-average depth of the decision tree for Zi built by 
X. We assume that none of the tables T^. are terminal, so h{zi,Pi) > for 

Table l4T] shows results of experiments with 24 data sets. Each row contains 
data set name, the minimum average depth of decision tree calculated by the 
algorithm A, and the average depth of decision trees built by each of the 
eight greedy algorithms. The last row shows the average relative difference 
for the greedy algorithms. One can see that a combination of weighted sum 
with Gini index (the criterion used by CART f8]) and entropy (the criterion 
used by IDS [TOj) results in the least ARD values. 

4.3 Modeling Monotonic Boolean Functions by 
Decision Trees 

The property of the algorithm A to build optimal decision trees can be 
used to find exact values of the Shannon- type function Hsin) described 
in Chap. 13.21 for small n. In this section, an experiment is described that 
calculates Hsin) for monotone functions depending on up to six variables. 
The number of monotone functions of n arguments (also known as Dedekind 
number M{n)) is a rapidly growing sequence. The second column of Table l4^ 
contains number M{n) for n = 1, . . . , 6. Using algorithm A, we built optimal 
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decision trees for all functions and thus calculated the value of Hb (n) that 
is given in the third column. The fourth and fifth column contain the lower 
and the upper bound on HB{n) which are given by Proposition [ 



Table 4.2 Hsin) and its bounds for the class of monotone functions 



n 


M{n) 


Hsin) 


Lower 
bound 


Upper 
bound 


1 


3 


1 


0.59 


1 


2 


6 


1.5 


1.27 


1.5 


3 


20 


2.5 


2 


2.5 


4 


168 


3.125 


2.76 


3.5 


5 


7561 


4.125 


3.55 


4.5 


6 


7828354 


4.8125 


4.35 


5.63 



The experiments revealed that the minimum average depth reaches its 
maximum on threshold functions described in Sect. 13.1.31 For odd n, the 
only function having h{f) = Hsin) is Thrn^(n+i)/2- For even n, there are 



two such functions: Thrr, 



V2 



and Thrr, 



i/2+l 



Note that all these functions 



are a-functions, so the obtained value of Hsin) is the same for the classes 



Mi,M2,M3,M4. The function Thr. 



n,(n+l)/2 



is a self-dual function for odd 



n, so the obtained values of iJs(n), n = 1,3,5, are applicable to the class 
D2. The experiment also allows to find the histogram of distribution of the 
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minimum average depth among functions in Mi of n variables. Fig. 14.31 and 
show histogram of h{f) for monotone functions of five and six variables. 



4.4 Constructing Optimal Decision Trees for Corner 
Point Detection 

In this section, we consider a problem that originated in computer vision: 
constructing an optimal testing strategy for corner point detection by FAST 
algorithm [TH [7S] . The problem can be formulated as a problem of building a 
decision tree with the minimum average depth. We experimentally compare 
performance of the algorithm A and several greedy algorithms that differ in 
the attribute selection criterion. 



4- 4-1 Corner Point Detection Problem 

One of the important problems considered in computer vision is object track- 
ing that is given a video stream, locating an object and determining its 
position in each frame. There are several approaches to object tracking. One 
of the accepted approaches is detecting feature points and acquiring the ob- 
ject position by these points. Rosten and Drumniond devised FAST algorithm 
[741 I75j that tracks an object by position of its corners and proposed a simple 
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Fig. 4.5 [74] FAST Feature detection in an image patch. The highlighted squares 
are the pixels used in the feature detection. The pixel at C is the center of a detected 
corner: the dashed line passes through 12 contiguous pixels that are brighter than C 
by more than the threshold ■y. 



algorithm for corner point detection. The algorithm iterates through all im- 
age pixels and detects corner points by comparing the intensity of the current 
pixel and surrounding pixels. In order to determine if an image pixel a is a 
corner point, a circle of 16 pixels (a Bresenham circle of radius 3) surround- 
ing a is examined: the intensity of each pixel of the circle is compared with 
the intensity of a. The pixel a is assumed to be a corner point if at least 12 
contiguous pixels on the circle are all either brighter or darker than a by a 
given threshold 7 (Fig. 14. 5p . 

The surrounding pixels can be tested in an arbitrary order, and the re- 
quired number of tests depends on the data and the chosen order of testing. 
A good testing strategy can reduce the expected number of checks and thus 
reduce the running time of the algorithm. One can see that in order to claim 
the current pixel as a corner point, at least 12 checks needs to be done, but 
some candidate pixels can be rejected after only four checks. For example, 
checking the circle points 1, 5, 9 and 13 allows to reject candidates that do 
not have at least three out of four pixels either darker or lighter than the 
central pixel. 

For an arbitrary pixel a and for i = 1, . . . , 16, denote by (f>i{a) the intensity 
of the i-th pixel in the circle surrounding a (the pixel ordering is shown on 
Fig. 14.31) and denote by 4>c{a) the intensity of the pixel a. The pixel a can 
be represented as an object that is characterized by the attributes /i, . . . , /ig 
where 
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, if 0i(a) - (l)c{a) < -7 , 
/.(a) = <(l, a \(f>,{a) - Ma)\ < J , 
2 , ii <j)i{a) - (l)c{a) > 7 . 

The problem of corner point detection can be formulated as a problem 
z — {ly, fi, . . . , /le) over a 3- valued information system U = {A, F). The set 
A contains examples of image patches collected from the training data set that 
is a set of images or a video fragment that is similar to one the algorithm will 
work with. The set F consists of the attributes /i, . . . , /ig, and the function 
v takes the value 1 if given combination of attribute values correspond to 
a corner point, and otherwise. Using the training data set, one can also 
estimate a probability distribution P as cardinality of the equivalence classes 
obtained by partitioning of A with the attributes from F. Then a valid testing 
strategy can be represented by a decision tree for z that solves z, and an 
optimal strategy corresponds to a tree with the minimum P-average depth. 

4-4-2 Experimental Results 

Following the method proposed by the authors of FAST algorithm, we es- 
timated the probability distribution from the training data. For each pixel 
a (except a 2-pixel outer boundary of each image), we calculated the tuple 
(/i(a), . . . , /16(a)) of attribute values. Then we formed a decision table T* 
that contains as rows all tuples of attribute values encountered in the train- 
ing data. Each row is assigned with the estimated probability that is the 
number of occurrences of the corresponding tuple. We did not include to the 
decision table the rows that do not appear in the training data. These tuples 
of attribute values may encounter on other images and may be misclassified, 
but we suppose they are less probable, so the number of misclassifications 
will be small and can be compensated by the sensor fusion technique on a 
subsequent stage of the object tracking algorithm. 

We performed an experiment that compares the average depth of decision 
trees built by the greedy algorithms with the minimum value. For training, 
we took three groups of images considered in [7S] named box, junk and maze 
and tried five values of the threshold 7: 30, 40, 50, 70 and 100. For each set of 
images and each threshold value, a decision table T* was constructed. Then 
for each decision table, decision trees were build by the algorithm A and by 
three greedy algorithms that use a combination of weighted-sum impurity 
function with gini, ent and me uncertainty measures respectively. 
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Table 4.3 Characteristics of decision tables 



Data set 


Threshold 


#of 


# of corner 


#of 


#of 


Time, 






rows 


points 


nodes 


edges 


s 


maze 


100 


1343 


8 


88577 


363418 


1 


maze 


70 


5303 


476 


1699986 


9223694 


59 


maze 


50 


15198 


2135 


5030983 


24660203 


339 


maze 


40 


27295 


4165 


8167123 


37285264 


830 


maze 


30 


50750 


9310 


14278124 


60596561 


2404 


junk 


100 


146 














junk 


70 


980 


8 


40045 


157697 


1 


junk 


50 


3509 


101 


742765 


4055185 


20 


junk 


40 


8323 


282 


1926830 


9946555 


85 


junk 


30 


18243 


847 


4379006 


22110004 


350 


box 


100 


680 


15 


58186 


308882 


1 


box 


70 


3225 


113 


918734 


4876964 


23 


box 


50 


10972 


546 


4059543 


20901371 


215 


box 


40 


20080 


1487 


7075517 


33320358 


574 


box 


30 


38381 


4258 


12404116 


53458575 


1660 



Table 4.4 Average depth of decision trees 



Data set 


Threshold 

7 


Min. avg. 
depth 


Uncertainty measure 


ent 


gini 


me 


maze 


100 


1.27327 


1.38421 


1.4073 


1.28518 


maze 


70 


2.97021 


3.31982 


3.43315 


3.32529 


maze 


50 


3.07119 


3.25339 


3.49671 


3.38137 


maze 


40 


3.13391 


3.28679 


3.55746 


3.45605 


maze 


30 


3.27496 


3.4028 


3.73163 


3.53888 


junk 


70 


1.17653 


1.22653 


1.22653 


1.22653 


junk 


50 


2.58393 


2.73041 


2.77344 


2.76917 


junk 


40 


2.77556 


3.01826 


2.97345 


2.98738 


junk 


30 


2.84794 


3.04665 


3.08738 


3.17831 


box 


100 


1.26912 


2.19706 


2.28824 


1.28235 


box 


70 


2.68217 


2.83969 


3.05891 


2.8093 


box 


50 


3.05851 


3.24426 


3.34962 


3.35992 


box 


40 


3.14631 


3.43322 


3.60652 


3.60931 


box 


30 


3.27373 


3.47195 


3.62302 


3.72645 


ARD 


0.10738 


0.14914 


0.07744 
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For each decision table, Table 14.31 cites the total number of rows and the 
number of corner points detected in the training data. Additionally, the table 
shows the number of nodes and the number of edges in the graph /^{z), and 
the running time of the algorithm A. The results give an evidence that A is 
capable of processing a table containing up to 50, 000 rows and producing a 
graph with more than 14,000,000 nodes. Table HH] gives the average depth 
of decision trees built by the exact algorithm and by the greedy algorithms. 
One can see that an optimal strategy requires less than four points to test on 
average. For each uncertainty measure C/, the last row contains the average 
relative deviation of the greedy algorithm that uses U in the impurity criteria. 

One can see that the greedy algorithms construct decision trees with the 
average depth 7— 15% greater than the minimum. The greedy algorithm that 
uses uncertainty measure me has the minimum ARD. However, for smaller 
values of 7, there is larger variety of data (more rows in the decision table) 
and the greedy algorithm that uses ent performs better (this is, in fact, IDS 
algorithm [TU] applied for this problem in ^?5j). 



Chapter 5 

Problems over Information Systems 



The problems of estimation of the minimum average time complexity of de- 
cision trees and design of efficient algorithms are complex in general case. 
The upper bounds described in Chap. 12.4.31 can not be applied directly due 
to large computational complexity of the parameter M[z). Under reasonable 
assumptions about the relation of P and NP, there are no polynomial time 
algorithms with good approximation ratio [T^l [311 ■ One of the possible solu- 
tions is to consider particular classes of problems and improve the existing 
results using characteristics of the considered classes. 

We use the notion of information system to describe a class of problems. 
The set of objects and the set of attributes are allowed to be infinite (but 
countable). Among all information systems, we distinguish the restricted in- 
formation systems in which any system of equations of the type "attribute" 
— "value" has an equivalent subsystem whose weight is below a predefined 
threshold. 

The first section describes the notion of restricted information system and 
gives bounds on the average weighted depth of decision trees depending only 
on the entropy. In the second section, we prove that for a restricted informa- 
tion system, under reasonable assumptions about weight function and prob- 
ability distribution, the time complexity of the algorithm A is limited from 
above by a polynomial on the number of attributes in the problem descrip- 
tion. Some results of this chapter were published in [17j . 

5.1 On Bounds on Average Depth of Decision Trees 
Depending Only on Entropy 

Let U = {A, F) be an information system and !Z/ be a weight function for 
U. Theorem 12.31 gives a bound on the minimum average weighted depth of 
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decision tree for an arbitrary problem z over [/. However, efficiency of this 
bound is Hniited due to large computational complexity of the parameter 
M^{z). Let us consider a necessary and sufficient condition for existence of a 
function <? such that h^{z,P) < <P{H{P)) for any problem z over U and any 
probability distribution P for z. 

For an arbitrary natural number i, a system of equations of the form 

{h{x)^5i,...,ft{x)=5t}, (5.1) 

where fi, . . . , ft € F and Si, ... ,St € Ek is called a .system of equations over 
U. A system of equations over U is called irreducible, if it does not have any 
proper equivalent subsystems. An information system U is called r-restricted 
(restricted) if each compatible system of equations over U has an equivalent 
subsystem that contains at most r equations. 

For the system of equations ()5.1|) . the value J2i=i '^(fi) i^ called the weight 
of the .system. An information system U is called r-restricted (restricted) 
relative to ^ if each compatible system of equations over U has an equivalent 
subsystem whose weight does not exceed r. 

Example 5.1. Let A — i?", and i^ be a nonempty set of mappings from i?" to 
R. Consider an infinite family of functions [E] — {sign(/ + Q:) + l:/ei^, aS 
R} (note that the expression (sign(a;) + 1) takes the value for a negative x, 
1 for a; = 0, and 2 for a positive x). If |F| = fc < cxd, then the information 
system U — {A, [F]) is 2fc-restricted (or 2fe-restricted relative to the weight 
function '1' =1). 

The following theorem for an arbitrary problem over a restricted information 
system and an arbitrary probability distribution, gives an upper bound on 
the minimum average weighted depth of decision tree that depends only on 
the entropy of probability distribution. 

Theorem 5.1. Let U he an information system, ^ a weight function for U 
and U be r-restricted relative to ^ where r is some natural number. Then 
h,p{z,P) < 2r{H{P) + 1) for an arbitrary problem z over U and an arbitrary 
probability distribution P for z. 

Proof. Let U he a fc- valued information system, U = {A, F) and z = 
(z^, /i, . . . , /„). li z = const on A, then obviously M,i,{z) < r. Let z ^ const on 
A. Let us consider an arbitrary tuple S = {Si, . . . ,Sn) from EJ! and show that 
M^{z, 6) < 2r. From the definition of the parameter M^{z, 5) it follows that 
there exists an irreducible system of equations S — {fi-^ (x) — Si-^, . . . , fi^ (x) = 
(5ij} over z such that t > Q and the weight of the system is M^p{z,5). De- 
note by A{S) the set of solutions of this system on A. If A{S) ^ 0, then the 
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weight of S does not exceed r. Let A{S) — 0. Denote ^i = 5 \ {fi^ (x) — Si^}. 
The equality A{S) — and the fact that S is irreducible imply that 5*1 is a 
compatible irreducible system. Therefore, the weight of the system 5*1 does 
not exceed r and the weight of the system S does not exceed r + ^{fi^). 
According to the definition of system of equations, fi^ {x) ^ const on A. Con- 
sequently, there exists a number 5 G E^ for which the set of solutions of the 
equation fi^ (a) = (5 is a nonempty proper subset of A. Then the system of 
equations {fi^ix) — 5} is irreducible and compatible, and its weight (that is 
equal to ^{fii)) does not exceed r. Therefore, the weight of the system of 
equations S does not exceed 2r and the value of the parameter M:p{z) (as 
the maximum of M^{z, 6), 5 G EJ^) does not exceed 2r. Theorem 12.31 implies 
that /i^(z,P) < 2r(iJ(P) + l). D 

The following theorem shows that the conditions of Theorem 15.11 are neces- 
sary and sufficient for existence of a linear upper bound depending only on 
the entropy and considering non-linear bounds does not extend the class of 
information systems that have upper bounds depending only on the entropy. 

Theorem 5.2. Let U be an information system that is not restricted rela- 
tive to the weight function ^ for U . Then for an arbitrary e > 0, there is 
no function (p that is limited within the interval [0, e] and possesses the con- 
dition h,p{z,P) < ${H{P)) for any problem z over U and any probability 
distribution P for z. 

Proof. Let U — {A, F) be a fc- valued information system. Assume that for 
some £ > 0, there exists a function <l> such that <P{x) < K, x G [0,£], and 
h4,{z,P) < (1>{H{P)) for any problem z over U and any probability distri- 
bution P for z. By the premise of the theorem, for each natural number i, 
there exists an irreducible compatible system of equations Si with the weight 
at least i. For i — 1,2, . . ., set into correspondence to the system of equa- 
tions Si a problem Zi over U. Let Si = {fl{x) = $1, . . . , fn-{x) — S\^.}. Then 
Zi = {v^, fl,.-., fn.) where v^ : {0, 1}"' -^ uj, 

and 5* = {6\, . . . ,Sl^.). Let the table T^- contain Si rows. Define a probability 
distribution Pj as follows: 

^ [(.?-., + !), ifJ=^% 
ll, iid^S' . 
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Let the function ^ be not limited. This aUows for choosing the systems Si 
such that each of them consists of a single equation. Then Si e {2, . . . ,k} for 
any i e a;\{0}, and the value <P{H{Pi)) takes at most k different values. Con- 
sequently, 'l>{H{Pi)) is limited from above by some constant (for convenience 
of notations let it be equal to K). 

Let the function 'I' be limited. This allows for choosing the systems of 
equations such that for any natural i the system Si contains at least i equa- 
tions. Irreducibility of the system Si implies the inequality Si > i. From the 
definition of entropy it follows that H{Pi) > for any natural number i. 
Apply the following transformations: 

H{P,) =. log2 sj - —{sj - s, + 1) \og^{sj - s, + 1) 

= l0g2 S- - f 1 - ^^^2— j ( log2 S? + log2(l - ^^^5— 

<-(l-^)log2(l-^)+^l0g2... 

One can see that for i — > 00, both summands tend to zero. Then there exists 
a number ig G w \ {0} such that H{Pi) < e for « > iq. According to the 
assumption, ^{H{Pi)) < K ioi i > iq. 

Let Pi be a decision tree for the problem Zi that solves Zi, and Pi be 
optimal for !^, Zi and Pi. Then there exists a complete path ^^ in Pi such 
that S^ € Tz-TT^^i). Since for an arbitrary row d € Tz-, d 7^ Si, the relation 
I'id) 7^ ^{Si) holds, the subtable Tz-T:{^i) docs not contain other rows except 
Si. Irreducibility of the system Si implies ^(^i) > J, and the nonterminal 
nodes of the path ^^ are assigned with all attributes from the set {/i , • • • , /^ . } ■ 
Using the definition of the average weighted depth, we obtain 



iv(r,,,p,) - s2 - 2 



Taking into account that the tree Pi is optimal for ^, Zi and Pi, we have 
h^{zi,Pi) > i/2. Therefore, there exists a number ii G lu \ {0} such that 
h\p{zi. Pi) > K for i > ii. Then for any number i* > max(io, ii), the inequal- 
ity h,p{zi-,Pi-) > <P{H{Pi*)) holds. Consequently, the considered assumption 
is wrong. D 
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5.2 Polynomiality Criterion for Algorithm yA. 

Let U — {A, fi, f2, ■ ■ ■) be an infinite information system and ^ a weight 
function for U. Denote Z{U) the set of problems over the information system 
U. For an arbitrary problem z, denote by dim z the number of attributes listed 
in the description of z. 
Consider the functions 

Su{n) — max{|S'(z)| : z e Z{U), dim z < n} 

and 

Vu{n) = max{D{T^) : z € Z{U),dimz < n} 

that characterize the dependence of the maximum number of separable subta- 
bles and the maximum number of rows on the number of columns in decision 
tables over U. 

Let 'F be restricted from above by some constant, and Q{x) be some poly- 
nomial. Theorem 14.21 implies that for an arbitrary problem z over U and an 
arbitrary Q-restricted probability distribution for the problem z, the time 
complexity of the algorithm A is restricted from above by a polynomial on 
the number of attributes in the problem description if the functions Su(n) 
and Vuln) are restricted from above by a polynomial on n. Also, one can see 
that the time complexity of the algorithm A has an exponential lower bound 
if the function Su {n) grows exponentially. 

Theorem 5.3. Let U = {A,F) be a k-valued information system. Then the 
following statements hold: 

a) if U is r -restricted, then Su{n) < [nky + 1 and 'Du{n) < {nky + 1 for 
any natural number n; 

b) if U is not restricted, then Sjj{n) > 2" — 1 for any natural number n. 

Proof, a) Let U be r-restricted and z = {v, fi, . . . , f„) £ Z{U). One can 
see that the values |S'(r2)| and D{Tz) do not exceed the number of pairwise 
nonequivalent compatible subsystems of the system of equations {/i(a;) = 
0, . . . , fnix) = 0, . . . , fi{x) = k - 1, . . . , fn{x) = fc - 1} including the empty 
system (assume the set of solutions of the empty system to be equal to A). 
Since the system of equations U is r-restricted, each compatible system of 
equations over U contains an equivalent subsystem of at most r equations. 
Then|5(z)| < (dimz)'^fc''-hl and ^(r^) < (dim z)'^fc'' -Hi. Therefore, 5(7 (n) < 
{nky + 1 and Vu{n) < {nky + 1. 

b) Let U be not restricted and n be a natural number. Then there exists 
an irreducible system of equations over U containing at least n. equations. 
Since each its subsystem is irreducible, there exists an irreducible system 
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over U that consists of n equations. Let it be the system (15. ip which will be 
denoted by W. Let us show that any two different subsystems Wi and W2 of 
W are not equivalent. Assume the contrary. Then the systems W^\ (Wi \ W2) 
and W \ {W2 \ Wi) are equivalent to W, and at least one of them is a 
proper subsystem of W that is impossible. Consider a diagnostic problem 
z = (i^, /i, . . . , /„). One can set into correspondence to a proper subsys- 
tem {/ii(a;) — 61,. .. , fi^{x) — 6t} of the system W the separable subtable 
Tzifii, ^1) . • . {fit, ^t) of the table T^. Since any two different subsystems are 
nonequivalent to each other and to the system W, the subtables correspond- 
ing to these subsystems are different and nonterminal. Then |5(rz)| > 2" — 1. 
Therefore, Suin) > 2" - 1. D 



Conclusions 



The monograph considers several aspects of the problem of constructing 
decision trees with the minimum time complexity. A known bound on the 
minimum average depth for a problem with a complete set of attributes is 
generalized in two ways. First, a bound for an arbitrary problem is obtained 
that depends on the parameter M{z). Second, a class of restricted informa- 
tion systems is described; so all problems over a restricted information system 
have a common bound depending only on the entropy. A necessary condition 
for the problem decomposition is described that might be too restrictive for 
using in applications, but it works in constructive proofs. 

An exact algorithm A for construction of decision trees has been studied 
both theoretically and experimentally. The experimental results described in 
Chap. 14.4.21 show that A is capable of processing a table with 16 attributes 
and more than 50000 rows. A class of all information systems was described 
for which the algorithm has polynomial time complexity on the decision table 
size. It allows further optimization by using branch and bound methods to 
reduce the search space. Current parallel computing environments enable 
an effective implementation of such algorithms and make it applicable for 
practical problems described by decision tables of a moderate size. 



Appendix A 

Closed Classes of Boolean Functions 



The lattice of all classes of Boolean functions closed relative to the operation 
of substitution has been described by Post in [H71 [HH] • In [35] , Yablonskii, 
Gavrilov and Kudriavtzev considered the structure of all classes of Boolean 
functions closed relative to the operation of substitution and the operations 
of insertion and deletion of inessential variable. Appendix contains the de- 
scription of this structure that is slightly different from Post's lattice. The 
text of Appendix is close to the text of Appendix in [^ . 

A.l Some Definitions and Notation 

Let U he a set of Boolean functions, f{xi, . . . , x„) be a function from U, and 
gi be either a function from U oi a, variable, i = 1, . . . ,n. We will say that 
the function f{gi, . . . ,5„) is obtained from functions from U by the operation 
of substitution. 

Let f{xi, . . . , Xn) be a Boolean function. A variable Xi of the function / 
is essential if there exist two n-tuples 6 and a from E2 that differ only 
in the i-th digit and for which f{6) ^ /(^)- The variables of the func- 
tion / that are not essential are called inessential variables. Let Xj be an 
inessential variable of the function / and g{xi, . . . ,Xj-i,Xj+i, . . . ,Xn) = 
/(xi, . . . , Xj-i,0, Xj+i, . . . , Xn). We will say that the function g is obtained 
from / by the operation of deletion of inessential variable. We will say that 
the function / is obtained from g by the operation of insertion of inessential 
variable. 

Let U he a nonempty set of Boolean functions. We denote by [U] the clo- 
sure of the set U relative to the operation of substitution and the operations 
of insertion and deletion of inessential variable. The set U is called a closed 
class a U = [U] . 
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The notion of a formula over U is defined inductively in the following way: 

a) The expression /(xi, . . . , x„), where f{xi, . . . , x„) is a function from U, 
is a formula over U. 

b) Let /(xi,...,x„) be a function from U and ipi, . . . ,ipn be expres- 
sions that are either formulas over U or variables. Then the expression 
f{ipi, . . . , ipn) is a formula over U. 

A Boolean function corresponds in natural way to any formula over U. We 
will say that the formula realizes this Boolean function. 

Denote by [U] i the closure of the set U relative to the operation of sub- 
stitution. One can show that [U]i coincides with the set of functions realized 
by formulas over U. Denote by [U]2 the closure of the set [U]i relative to 
the operations of insertion and deletion of inessential variable. One can show 
that [U] = [U]2. 

We denote the logical negation operation by ^ and the modulo 2 sum- 
mation by ©. For a natural n and i G -E2, denote by i„ the n-tuple 
{t,t,...,t) <E £2- Let / be a Boolean function depending on n variables. 
The function / is called a-function if /(in) — t for any t € E2, j3-function if 
f{tn) = 1 for any t G i<^2, and ^-function if /(t„) = for any t G E2- 

A function / is called a linear function if / = cq © ciXi ® . . . ® c„x„ where 
Ci Cz E2, < i < n. A function / is called a self-dual function if /(xi, . . . , x„) 
= -^f{-^xi, . . . , -^Xn). A function / is called a monotone function if for any 
n-tuplcs 5 = ((5i, . . . , 5n) and a ~ (ci, . . . , cr„) from E2 such that 5i < ai, 
1 < i < n, the inequality f{6) < f{a) holds. 

Let fi (z Lu \ {0, 1}. We will say that a function f{xi, . . . , x„) satisfies the 
condition {a'^) if for any /i tuples from E2 on which / takes the value there 
exists a number j G {1, . . . , n} such that in each of the considered tuples the 
j'-th digit is equal to 0. We will say that the function / satisfies the condition 
(a°°) if there exists a number j G {1, . . . ,ri} such that in any n-tuple from 
E2 on which / takes the value the j-th digit is equal to 0. We will say that 
the function / satisfies the condition (A^) if for any /i tuples from E2 on 
which / takes the value 1 there exists a number j G {1, . . . , n} such that in 
each of the considered tuples the j-th digit is equal to 1. We will say that a 
function / satisfies the condition {A°°) if there exists a number j G {1, . . . , n} 
such that in any n-tuple from E2 on which / takes the value 1 the j-th digit 
is equal to 1. The constant 1, by definition, satisfies the condition (a°°) and 
does not satisfy the condition (A^). The constant 0, by definition, satisfies 
the condition {A°°) and does not satisfy the condition (a^). 

Let /i G w \ {0, 1}. Denote 

M-l-i 
^fj- = \/ (xi A X2 A . . . A Xi_i A Xi+i A ... A x^+i) 

i=l 
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and 

hl,= /\{xiV X2V ...V Xi-i V Xi+i V ... V x^+i) . 

i=\ 

A. 2 Description of All Closed Classes of Boolean 
Functions 

In this subsection, all closed classes of Boolean functions are listed. For each 
class the Post notation is given, the description of functions contained in 
the considered class is presented, and a finite set of Boolean functions is 
given such that its closure relative to the operation of substitution and the 
operations of insertion and deletion of inessential variable is equal to this 
class. 

As in [36' , two Boolean functions are called equal if one of them can 
be obtained from the other by the operations of insertion and deletion of 
inessential variable. 

The inclusion diagram for closed classes of Boolean functions [3S] is de- 
picted in Fig. A.l. Each closed class is represented by a dot. The dots cor- 
responding to certain classes U and V are connected with an edge if V is 
immediately included into U (there are no intermediate classes between U 
and y); in this case, the dot corresponding to the outer class U is placed 
higher on the diagram. 1. The class 0\ — \{x\\. This class consists of all 
functions equal to the function cc, and all functions obtained from them by 
renaming of variables without identification. 

2. The class O2 = [{!}]• This class consists of all functions equal to the 
function 1. 

3. The class O3 = [{0}]. This class consists of all functions equal to the 
function 0. 

4. The class O4 = [{-'x}]. This class consists of all functions equal to 
the functions x or ^x, and all functions obtained from them by renaming of 
variables without identification. 

5. The class O5 = [{x, 1}]. This class consists of all functions equal to 
the functions 1 or x, and all functions obtained from them by renaming of 
variables without identification. 

6. The class Og — [{x,0}]. This class consists of all functions equal to 
the functions or a;, and all functions obtained from them by renaming of 
variables without identification. 

7. The class O7 — [{0, 1}]. This class consists of all functions equal to the 
functions or 1. 
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Fig. A.l Inclusion diagram for closed classes of Boolean functions 



8. The class O^ — [{a;,0, 1}]. This class consists of all functions equal to 
the functions 0, 1 or x, and all functions obtained from them by renaming of 
variables without identification. 

9. The class Og = [{^x, 0}]. This class consists of all functions equal to the 
functions 0, 1, ^x, or x, and all functions obtained from them by renaming 
of variables without identification. 

10. The class 5*1 — [{x V j/}]. This class consists of all disjunctions (i.e., 
functions of the form V"=i Xi^ n = 1,2, . . . and all functions obtained from 
them by renaming of variables without identification) . 

11. The class 5*3 — [{x V y, 1}]. This class consists of all disjunctions and 
all functions equal to 1. 
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12. The class S5 ~ [{x V y, 0}]. This class consists of all disjunctions and 
all functions equal to 0. 

13. The class 5*6 = [{x\/y, 0, 1}]. This class consists of all disjunctions and 
all functions equal to the functions or 1. 

14. The class Pi — [{x A y}]. This class consists of all conjunctions (i.e., 
functions of the form /\^^i Xi, n = 1,2, . . . and all functions obtained from 
them by renaming of variables without identification) . 

15. The class P3 — [{x A j/,0}]. This class consists of all conjunctions and 
all functions equal to 0. 

16. The class P5 — [{x A y, 1}]. This class consists of all conjunctions and 
all functions equal to 1. 

17. The class Pg = [{x A. y, 0,1}]. This class consists of all conjunctions 
and all functions equal to or 1. 

18. The class Li = [{x ® y, 1}]. This class consists of all linear functions. 

19. The class L2 = [{x(By(B 1}]. This class consists of all linear a-functions 
and /3-functions (i.e., functions of the form ®j^x a;^ © 1, ®j^x ^«' ^'^ ~ 
0,1,2,... and all functions obtained from them by renaming of variables 
without identification) . 

20. The class L^ ~ [{x © y}]. This class consists of all linear a-functions 
and 7-functions (i.e., functions of the form ©j^^a^i,^ = 0,1,2,... and all 
functions obtained from them by renaming of variables without identifica- 
tion) . 

21. The class L4 ~ [{x(By®z}]. This class consists of all linear a-functions 
(i.e., functions of the form ^^^i Xi, / = 0, 1, 2, . . . and all functions obtained 
from them by renaming of variables without identification) . 

22. The class L5 = [{x © y © z © 1}]. This class consists of all linear 
self-dual functions (i.e., functions of the form ®j^i Xi © l,®j^i Xi, I = 
0,1,2,... and all functions obtained from them by renaming of variables 
without identification) . 

23. The class D2 ~ [{(x A y) V (x A z) V (y A z)}]. This class consists of all 
self-dual monotone functions. 

24. The class Di = [{(x A y) V (x A ^z) V (y A ^z)}]. This class consists of 
all self-dual a-functions. 

25. The class D3 = [{(x A ^y) V (x A ^z) V {-^y A ^z)}]. This class consists 
of all self-dual functions. 

26. The class Ai = Mi = [{x A y,x V y,0, 1}]. This class consists of all 
monotone functions. 

27. The class A2 = M2 — [{x Ay,x\/ y, 1}]. This class consists of all 
monotone a-functions and /3-functions. 
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28. The class A3 = M3 = [{x A y,x\/ j/,0}]. This class consists of all 
monotone a-functions and 7-functions. 

29. The class A4 = M4 = [{xA y, xVy}] . This class consists of all monotone 
a- functions. 

30. The class Ci = [{^(xAj/)}]. This class consists of all Boolean functions. 

31. The class C2 — [{xVy, x0 j/© 1}]. This class consists of all a-functions 
and /3- functions. 

32. The class C3 — [{x Ay,x (B y}]. This class consists of all a-functions 
and 7-functions. 

33. The class C4 — [{x \/ y,x A (y z 1)}]. This class consists of all 
a-functions. 

34. The class Ff = [{x V {y A -^z), h*}], ^ ~ 2,3, . . . . This class consists 
of all a-functions satisfying the condition (a^). 

35. The class F^, ^ = 2,3, . . . where F^ = [{x V (y A z), /i^}] li fi ^ 2, 
and F2 = [{h*}] if M ^ 3. This class consists of all monotone a-functions 
satisfying the condition (a^). 

36. The class F^ = [{l,h*}], fi = 2,3,... . This class consists of all 
monotone functions satisfying the condition (a^). 

37. The class Fj^ ~ [{x V ^y, h*}], fi — 2,3, . . . . This class consists of all 
functions satisfying the condition (a^). 

38. The class F^ = [{x A (y V -'z), /i^}], /x = 2, 3, . . . . This class consists 
of all a-functions satisfying the condition (A^^). 

39. The class F^, ^ = 2, 3, . . . where F^^ = [{x A (y V z), /i2}] if A* = 2, 
and Fq = [{hp,}] if /x > 3. This class consists of all monotone a-functions 
satisfying the condition (A^). 

40. The class F^ ~ [{0,/ip}], /i = 2,3,... . This class consists of all 
monotone functions satisfying the condition (A^). 

41. The class F^ — [{x A -^y, h^}], /i = 2, 3, . . . . This class consists of all 
functions satisfying the condition (A^). 

42. The class F^ — [{xV (y A ^2:)}]- This class consists of all a-functions 
satisfying the condition {a°°). 

43. The class F^ — [{x V (y A z)}]. This class consists of all monotone 
a-functions satisfying the condition (a°°). 

44. The class F^ — [{l,x V (y A z)}]. This class consists of all monotone 
functions satisfying the condition {a°°). 

45. The class F^ = [{x\/^y}]. This class consists of all functions satisfying 
the condition (a°°). 

46. The class F^ — [{x A (y V ^2:)}]- This class consists of all a-functions 
satisfying the condition (j4°°). 
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47. The class F^ = [{x l\ {y W z)Y\. This class consists of all monotone 
a- functions satisfying the condition {A°°). 

48. The class F^ = [{0, x A (y V 2;)}]. This class consists of all monotone 
functions satisfying the condition {A°°). 

49. The class F^ — [{xA^j/}]. This class consists of all functions satisfying 
the condition {A°°). 
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